[Nutch-dev] SegmentReader related changes

Doug Cutting Fri, 21 Jan 2005 13:10:07 -0800

Andrzej,

I just fixed a few problems related to SegmentReader, but I think more remain.

First, on open, SegmentReader scanned through the entire fetcher output file in order to find the total number of entries, which is expensive for large segments. I optimized this to use the index to skip to a point near the end, adding a new MapFile.finalKey() method. But for truncated segments whose index is incomplete this will must still scan. Do we really need to know the number of entries?

Then I noticed that IndexReader uses SegmentReader.get() rather than SegmentReader.next() to sequentially access entries. This is normally okay, as, even though SegmentReader.get() is capable of random access, it optimizes for sequential access patterns using the index. But when a MapFile has no index, or its index is truncated, this optimization fails, and every call to get() results in a scan of the file. Thus enumerating entries with get() can be quadratic in the size of the file! I was seeing segments with truncated indexes take more than 50 times as long to index! So I changed IndexReader to use SegmentReader.next() instead of SegmentReader.get(). Do you see any problem with this?

Also, I note that SegmentSplitter and SegmentMerger both use the same idiom of (a) finding the length of the file; then (b) using SegmentReader.get() to "randomly" access its contents. Is there any reason not to change these to use SegmentReader.next() instead? SegmentReader.get() should really only be used when access is truly random, e.g. with query results, as it can be very slow with truncated files.

Finally, perhaps we should add code to repair MapFile indexes. These record the key of every 128th entry. But, since you've increased the io buffer size to 128k, this buffer is only flushed every 8k entries. So random access to the last 8k index entries of a truncated file will always require a scan of on average 8k*128 = 512k entries rather than 128 entries, 4096 times slower! So either we should keep the write buffer size small for MapFile indexes, or we should attempt to repair the index data structure in memory with a single scan, rather than scanning on each access. Perhaps both. Decreasing the write buffer size for indexes is the easiest thing to do. I don't see a reason to have this larger than 1k, since it's only touched every 128 entries. A 1k buffer would hold 64 ArrayFile index entries, so we'd still have to scan 8k entries. Maybe we should just flush the index each time something is added to it?

Does this all make sense?

Doug


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] SegmentReader related changes

Reply via email to