Re: Problem with crawling using the latest 1.0 trunk

Andrzej Bialecki Tue, 03 Mar 2009 05:28:17 -0800

Andrzej Bialecki wrote:

Sami Siren wrote:
Sami Siren wrote:
Sami Siren wrote:
I can see this error also. not sure yet what's going wrong...
it's NUTCH-703 (hadoop upgrade) that broke the indexing. any ideaswhat changed in hadoop that might have caused this?
found the hostile hadoop commit:http://svn.apache.org/viewvc?view=rev&revision=736239
any ideas how to proceed? Naturally i won't be starting the releaceproces before this is resolved.
I'll work on this now, we'll see if there's a solution ... In the worstcase we could downgrade to 0.19.0, but there were some unpleasant bugsthere - so I'll try to find a solution so that we can keep 0.19.1 .

For now I tracked it down to the missing field options in LuceneWriter -basically, the only field options it has is the following:


fieldIndex: {segment=NO, digest=NO, boost=NO}
fieldStore: {segment=YES, boost=YES, digest=YES}
fieldVector: {segment=NO, digest=NO, boost=NO}

So if the LuceneWriter processes e.g. "site", it comes back withStore.NO and Index.NO, which indeed doesn't make sense.

Please note that LuceneWriter is properly initialized with the currentJobConf - so I'm not sure where these defaults should come from ...? Thecode in LuceneWriter can take this information fromNutchDocument.metadata[lucene.*] properties, but they are not populatedby any indexing plugin ...


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Problem with crawling using the latest 1.0 trunk

Reply via email to