Hi, I am facing an interesting problem. I am crawling in iterative cycles and it works fine until one of fetch cycles is prematurely terminated due to timeout - which result in this message to be written into log file [Aborting with 3 hung threads.]; (I am using 3 threads). And lets say that this fetch fetched only 101 pages (out of 500) before it was terminated.
Then the problem is that I can see only 101 pages in merged index no matter how many pages were fetched in previous cycles. Is seems to me that it is not possible to build healthy merged index if one of fetches time-outed. Then if I open index with Luke it shows that the total number of documents is only 101. Here are details: My script looks like the following example: --- start --- #!/bin/bash d=crawl.test bin/nutch generate $d/crawldb $d/segments -topN 500 s=`ls -d $d/segments/2* | tail -1` bin/nutch fetch $s bin/nutch updatedb $d/crawldb $s bin/nutch invertlinks $d/linkdb $d/segments bin/nutch index $d/indexes $d/crawldb $d/linkdb $s bin/nutch dedup $d/indexes bin/nutch merge $d/index $d/indexes --- end --- So once fetch operation is terminated then the rest of the tasks is executed anyway (updatedb, indexing ...). Also is seems to me that in this case it doesn't matter if I execute merge at the end of every cycle or just once after desired crawl depth is reached. Can anybody explain me what I am doing wrong? Thanks, Lukas
