Andrzej Bialecki wrote:
[EMAIL PROTECTED] wrote:
Interesting to know. However I never had this good luck, I got everytime a unexpected EOF Exception.
Yeah, that's the symptom of missing index.
I thought i'd fixed this some time ago. One still might get an when iterating through entries from a truncated segment, but no longer when opening it. So it should always be possible to read all the entries that were flushed: an index file should always be present, and EOF on the index file should be trapped, generating only a warning.
You are right - I checked the code once again. There should only be a warning, unless the index file was missing. But in that case there is an IOException thrown, not an EOFException - perhaps Stefan had this in mind...
I recently added a MapFile.fix() method to restore a missing index file if there is at least partial data in the "data" file. There is a frontend tool for fixing partial segments, which I plan to commit today/tomorrow.
Actually, it is possible to make it more resilient to crashes by setting MapFile.Writer.setIndexInterval() to a smaller value (default 128, most likely it should be read from the config), and then by making BufferedRandomAccessFile.flushBuffer() method public, so that the SequenceFile.Writer may call it after each index append - this way not only the index will be always written quickly (as if it were unbuffered), but also more frequently, resulting in smaller "chunks" of possibly lost data.
Are you certain that the index is the problem?
No. That was a fairly old email, since then I've been reading the code ;-)
Now I know that it's not critical for the index file to be complete, it just somewhat degrades the performance if it's not complete - nice design!
However, regarding the buffering: when I was testing the SegmentMergeTool it made a noticable difference when using slightly larger buffers than the default of 1 hardware page - all arguments about FS caching notwithstanding... I ended up using buffers of 128kB-1024kB in size. However, for such large buffers it sometimes makes sense to explicitly call flushBuffer() (e.g. after finishing processing at some stage from which it's possible to recover later, even if the process crashes right after that). This method is private now, but I'm thinking of making it public, for this reason.
Perhaps instead one could just trap EOF in MapFile.Reader.next() to generate a warning and return null?
Whether we should return null, or throw an exception depends on how serious we think this error condition is. I tend to agree with you that an EOFException here is not a tragedy, but a more or less "normal" condition, so we could return null + a warning.
Now, the question is whether we have any other code that depends on next() not returning null...
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
------------------------------------------------------- This SF.Net email is sponsored by: Sybase ASE Linux Express Edition - download now for FREE LinuxWorld Reader's Choice Award Winner for best database on Linux. http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
