Re: [Nutch-dev] SegmentReader related changes

Andrzej Bialecki Fri, 21 Jan 2005 15:23:54 -0800

Doug Cutting wrote:

Andrzej,
I just fixed a few problems related to SegmentReader, but I think more remain.

First, on open, SegmentReader scanned through the entire fetcher output file in order to find the total number of entries, which is expensive for large segments. I optimized this to use the index to skip to a point near the end, adding a new MapFile.finalKey() method. But for truncated segments whose index is incomplete this will must still scan. Do we really need to know the number of entries?

I don't think this is required for most operations. IIRC this value is used mostly when displaying informative data about the segment. In case of SegmentMergeTool it is interesting for the user to know how many entries are on the input and how many on the output - but of course we can calculate the input numbers only after processing the segments...

Then I noticed that IndexReader uses SegmentReader.get() rather than SegmentReader.next() to sequentially access entries. This is normally okay, as, even though SegmentReader.get() is capable of random access, it optimizes for sequential access patterns using the index. But when a MapFile has no index, or its index is truncated, this optimization fails, and every call to get() results in a scan of the file. Thus enumerating entries with get() can be quadratic in the size of the file! I was seeing segments with truncated indexes take more than 50 times as long to index! So I changed IndexReader to use SegmentReader.next() instead of SegmentReader.get(). Do you see any problem with this?


No problem, that's a good catch.

Also, I note that SegmentSplitter and SegmentMerger both use the same idiom of (a) finding the length of the file; then (b) using SegmentReader.get() to "randomly" access its contents. Is there any reason not to change these to use SegmentReader.next() instead? SegmentReader.get() should really only be used when access is truly random, e.g. with query results, as it can be very slow with truncated files.

SegmentSplitter should be fixed to use next(). However, in case of SegmentMergeTool in the last phase (copying of just unique records) it loops through the documents in the temporary Lucene index. These potentially can point to random entries. On the other hand, the algorithm could be reworked to loop through segment data and run a TermQuery on the temp index to check if the record has been deleted. I need to check which version is faster.

Finally, perhaps we should add code to repair MapFile indexes. These


I already added this - MapFile.fix().

However, currently when we open a MapFile the readIndex() method just informs about truncated index, and doesn't throw Exception. IMO it should either throw Exception, or automatically run the fix() method.

There are many Nutch processes whose performance strongly depends on correct indexes of MapFiles. IMHO it's better to fix these errors asap instead of relying on the fact that they would still work...

record the key of every 128th entry. But, since you've increased the io buffer size to 128k, this buffer is only flushed every 8k entries. So random access to the last 8k index entries of a truncated file will always require a scan of on average 8k*128 = 512k entries rather than 128 entries, 4096 times slower! So either we should keep the write buffer size small for MapFile indexes, or we should attempt to repair the index data structure in memory with a single scan, rather than scanning on each access. Perhaps both. Decreasing the write buffer size for indexes is the easiest thing to do. I don't see a reason to have this larger than 1k, since it's only touched every 128 entries. A 1k buffer would hold 64 ArrayFile index entries, so we'd still have to scan 8k entries. Maybe we should just flush the index each time something is added to it?

Does this all make sense?

Yes, of course. I think that adding a flush() operation when appending to index would solve this nicely.

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] SegmentReader related changes

Reply via email to