Gal Nitzan wrote:
Hi,

Well I still get a very slow mergesegs:

[EMAIL PROTECTED] nutch]# tail -f nutch-mergesegs-kunzon.com.log
050919 171351  Processed 120000 records (1146.5918 rec/s)
050919 171408  Processed 140000 records (1158.2788 rec/s)
050919 171428  Processed 160000 records (1019.8358 rec/s)
050919 171451  Processed 180000 records (879.2368 rec/s)
050919 171510  Processed 200000 records (1054.9636 rec/s)
050919 171528  Processed 220000 records (1069.2328 rec/s)
050919 171547  Processed 240000 records (1099.868 rec/s)
050919 171832  - creating next subindex...
050919 174512  Processed 260000 records (11.328647 rec/s)
050919 200315  Processed 280000 records (2.4145627 rec/s)

It is falling to 2.4 res per second ...

Can somebody help me please. 400K records is only the beginning what will happen when it is 4M?

>050917 043332 - data in segment index/segments/20050916014401 is corrupt, using only 128115 entries.

This is the real reason for the slowdown. Technically speaking, a partially corrupted MapFile is readable and usable. However, random access is orders of magnitude slower...

The fix is simple: delete the "index" files in each subdirectory of the 20050916014401 segment. Then run "nutch segread -fix 20050916014401". Then re-run mergesegs - it will now work at full speed.

NB. if there are any more segments which give you this warning, do the same before you run mergesegs.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42" plasma tv or your very
own Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to