On Fri, Nov 12, 2004 at 11:15:18PM +0100, Andrzej Bialecki wrote:
> Andrzej Bialecki wrote:
> >Hi,
> >
> >I just committed a high-level API for working with segment data. The 
> >classes are located in net.nutch.segment.* package.
> 
> I just realized that SegmentReader doesn't work now with segments 
> created by Fetcher in -noparse mode. I'll fix it in a day or two.
> 
> However, I have a similar issue for the new version of SegmentMergeTool, 
> but here I'm not sure how to react if I discover mixed-mode segments on 
> input, i.e. some segments created in full-parse mode, and some created 
> in -noparse mode... Should the tool in such case do one of the following:
> 
> * assume that you don't want the parse data, and you will re-create it 
> anyway for all data in the output segment. This means that it should 
> merge all input from both "fetcher" and "fetcher_output" into a single 
> output "fetcher", and at the same time discard all data in parse_data 
> and parse_text.

One thorny issue is: how to deal with various FetcherOutput states.
Before parsing was separated from fetching, failed parsing
was logged as NOT_FOUND. Now it will be marked as CANT_PARSE.
We may have to increase VERSION in FetcherOutput from 4 to 5,
so that "old" ./fetcher can be easily distignushed from new ./fetcher
and ./fetcher_output. I did not do that because not feel compelled
at that time.

John

> 
> * assume you want only parsed segments, and skip all non-parsed 
> segments, issuing a warning
> 
> * assume you want only parsed segments, and run ParseSegment if 
> parse_text is missing.
> 
> 
> Any suggestions?
> 
> -- 
> Best regards,
> Andrzej Bialecki


-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to