Hello, I have a fairly big crawl (50 000 documents) that I'd like to re-parse without actually having to re-fetch it. I tried going segment by segment. Let's say we have the following segment:
nutch/crawl-xyz/20080516162726 It contains the following directories: nutch/crawl-xyz/20080516162726/content nutch/crawl-xyz/20080516162726/crawl_fetch nutch/crawl-xyz/20080516162726crawl_generate nutch/crawl-xyz/20080516162726/crawl_parse nutch/crawl-xyz/20080516162726/parse_data nutch/crawl-xyz/20080516162726/parse_text I first renamed the last 3 directories to something else, the idea being to lure the ./bin/nutch parse command. I the lunched it: ./bin/nutch parse ./crawl-xyz/segments/20080516162752 It seemed to work, the 3 directories having been reconstructed. But when I check the content of the crawl_parse directory, the size of the part-00000 generated file was ridiculous, 1k, compared to the size of the original one: 350k. I guess I did something wrong... My objective is actually fairly simple, force the execution of one homemade parse plugin (it implements HtmlParseFilter) on all the stored fetched data without, as I said, refetching everything. I know how to take care of the rest to reconstruct the index. Is this actually possible? Thank you and good week-end! David
