Re: Problem with crawling using the latest 1.0 trunk

Doğacan Güney Tue, 03 Mar 2009 07:52:58 -0800

Hi,

On Tue, Mar 3, 2009 at 15:27, Andrzej Bialecki <[email protected]> wrote:
> Andrzej Bialecki wrote:
>>
>> Sami Siren wrote:
>>>
>>> Sami Siren wrote:
>>>>
>>>> Sami Siren wrote:
>>>>>
>>>>> I can see this error also. not sure yet what's going wrong...
>>>>
>>>> it's NUTCH-703 (hadoop upgrade) that broke the indexing. any ideas what
>>>> changed in hadoop that might have caused this?
>>>
>>> found the hostile hadoop commit:
>>> http://svn.apache.org/viewvc?view=rev&revision=736239
>>>
>>> any ideas how to proceed? Naturally i won't be starting the releace
>>> proces before this is resolved.
>>
>> I'll work on this now, we'll see if there's a solution ... In the worst
>> case we could downgrade to 0.19.0, but there were some unpleasant bugs there
>> - so I'll try to find a solution so that we can keep 0.19.1 .
>
> For now I tracked it down to the missing field options in LuceneWriter -
> basically, the only field options it has is the following:
>
> fieldIndex: {segment=NO, digest=NO, boost=NO}
> fieldStore: {segment=YES, boost=YES, digest=YES}
> fieldVector: {segment=NO, digest=NO, boost=NO}
>
> So if the LuceneWriter processes e.g. "site", it comes back with Store.NO
> and Index.NO, which indeed doesn't make sense.
>


This is added in index-basic plugin in method addIndexBackendOptions.

> Please note that LuceneWriter is properly initialized with the current
> JobConf - so I'm not sure where these defaults should come from ...? The
> code in LuceneWriter can take this information from
> NutchDocument.metadata[lucene.*] properties, but they are not populated by
> any indexing plugin ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>



-- 
Doğacan Güney

Re: Problem with crawling using the latest 1.0 trunk

Reply via email to