I am having the same problem as Allan. I checked out mahout from trunk and
tried to create term frequency vector from a lucene index and ran into
this..
09/10/27 17:36:10 INFO lucene.Driver: Output File:
/Users/shoeseal/DATA/luc2tvec.out
09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.NullPointerException
at
org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109)
at
org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1)
at
org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40)
at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200)
I am running this from Eclipse (snow leopard with JDK 6), on an index that
has field with stored term vectors..
my input parameters for Driver are:
--dir <path>/smallidx/ --output <path>/luc2tvec.out --idField id_field
--field field_with_TV --dictOut <path>/luc2tvec.dict --max 50 --weight tf
Luke shows the following info on the fields I am using:
id_field is indexed, stored, omit norms
field_with_TV is indexed, tokenized, stored, term vector
I can run the test LuceneIterableTest fine but when I run the Driver on my
index I get into trouble. Any possible reasons for this behavior besides not
having an index field with stored term vector?
Thanks.
- sushil
Grant Ingersoll-6 wrote:
>
>
> On Jul 2, 2009, at 12:09 PM, Allan Roberto Avendano Sudario wrote:
>
>> Regards,
>> This is the entire exception message:
>>
>>
>> java -cp $JAVACLASSPATH org.apache.mahout.utils.vectors.Driver --dir
>> /home/hadoop/Desktop/<urls>/index --field content --dictOut
>> /home/hadoop/Desktop/dictionary/dict.txt --output
>> /home/hadoop/Desktop/dictionary/out.txt --max 50 --norm 2
>>
>>
>> 09/07/02 09:35:47 INFO vectors.Driver: Output File:
>> /home/hadoop/Desktop/dictionary/out.txt
>> 09/07/02 09:35:47 INFO util.NativeCodeLoader: Loaded the native-hadoop
>> library
>> 09/07/02 09:35:47 INFO zlib.ZlibFactory: Successfully loaded &
>> initialized
>> native-zlib library
>> 09/07/02 09:35:47 INFO compress.CodecPool: Got brand-new compressor
>> Exception in thread "main" java.lang.NullPointerException
>> at
>> org.apache.mahout.utils.vectors.lucene.LuceneIteratable
>> $TDIterator.next(LuceneIteratable.java:111)
>> at
>> org.apache.mahout.utils.vectors.lucene.LuceneIteratable
>> $TDIterator.next(LuceneIteratable.java:82)
>> at
>> org
>> .apache
>> .mahout
>> .utils
>> .vectors
>> .io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:25)
>> at org.apache.mahout.utils.vectors.Driver.main(Driver.java:204)
>>
>>
>> Well, I used a nutch crawl index, is that correct? mmm... I have
>> change to
>> contenc field, but nothing happened.
>> Possibly the nutch crawl doesn´t have Term Vector indexed.
>
> This would be my guess. A small edit to Nutch code would probably
> allow it. Just find where it creates a new Field and add in the TV
> stuff.
>
--
View this message in context:
http://www.nabble.com/Creating-Vectors-from-Text-tp24298643p26087537.html
Sent from the Mahout User List mailing list archive at Nabble.com.