All, I want to get nutch to index the file system. My first approach was to nfs-mount the file system and et nutch crawl through the hierachary over http/Apache. This turned out to be fairly slow ~3,000 fetches per hour. Next approach was to go via file:/// <file:///> and to generate a file list to be crawled. This file list is fairly big ~200,000 entries, and with the current 0.8.1 release of nutch the fetcher just freezes right at the end of a crawl. Other strategies to split up the filelist into smaller parts ~20,000 and subsequently merging the indexes still fail for the same reason.
Anybody doing an extensive crawl with nutch through the file system in the community - what's your setup? Cheers, Bruno
