[EMAIL PROTECTED] wrote:

The segment that you actually crawl will be lost.

Not really - you get a partial segment, which may or may not be usable.


Interesting to know. However I never had this good luck, I got everytime a unexpected EOF Exception.

Yeah, that's the symptom of missing index.

May this would one of the useful improvements to make nutch more error restent.

Actually, it is possible to make it more resilient to crashes by setting MapFile.Writer.setIndexInterval() to a smaller value (default 128, most likely it should be read from the config), and then by making BufferedRandomAccessFile.flushBuffer() method public, so that the SequenceFile.Writer may call it after each index append - this way not only the index will be always written quickly (as if it were unbuffered), but also more frequently, resulting in smaller "chunks" of possibly lost data.


The cost of this is a slightly increased memory use (the index file is loaded fully in memory by MapFile.Reader), but other factors (increased disk usage for index file, decreased write performance of the index file because of buffer thrashing) are probably negligible. The advantage is that you should be able to read more valid entries from corrupted files.


Thanks for the hint, we may should add this to the wiki as well.

Feel free to update it, if you wish.

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to