Doug Cutting wrote:Do we really need to know the number of entries?
I don't think this is required for most operations.
So, should we remove it from places where it's not required?
SegmentSplitter should be fixed to use next().
I'm glad you agree.
Finally, perhaps we should add code to repair MapFile indexes. These
I already added this - MapFile.fix().
Sorry I missed that! That's good to have. Thanks!
However, currently when we open a MapFile the readIndex() method just informs about truncated index, and doesn't throw Exception. IMO it should either throw Exception, or automatically run the fix() method.
There are many Nutch processes whose performance strongly depends on correct indexes of MapFiles. IMHO it's better to fix these errors asap instead of relying on the fact that they would still work...
We should not automatically attempt to modify a file when it's only been opened for read. So fixing the index in memory on open might be acceptable. But, better yet, as discussed below, let's make it so that indexes are always nearly complete, then the need to fix them perfectly will not be great. Right now, with a big index buffer and no flushes, when a process aborts the index is usually empty. If we instead make sure the index covers the vast majority of the file, then performance should be good, no?
Yes, of course. I think that adding a flush() operation when appending to index would solve this nicely.
Right, think we should flush rather whenever a data buffer (128k by default) is written, thus more-or-less syncing the files. The buffer for the index should also be 1k rather than 128k, but that alone won't fix the problem. We don't need to precisely sync the data and index, we can simply check after each append to see if a 128k boundary has passed, and, if it has, flush the index. Does that sound good to you?
Cheers.
Doug
------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
