Hi, My name is Mohamed, and I'm working on a project to integrate nutch with Heritrix So I converted *ARC*-files (Heritrix) into segments using *ArcSegmentCreator *.
$ ./bin/nutch org.apache.nutch.tools.arc.ArcSegmentCreator <ArcFiles> <ArcCrawlDir/segments> => but the result of this command gives me this message Ignoring position: 22878 Ignoring position: 36616 Ignoring position: 152183 Ignoring position: 167752 Ignoring position: 293285 Ignoring position: 334078 ... Ignoring position: 54757983 Ignoring position: 54891832 =>and in the ArcCrawlDir I found all the needed files : /nutch-1.0/ArcCrawlDir/segments/20090527165114$ ls -R .: content crawl_fetch crawl_parse parse_data parse_text ./content: part-00000 ./content/part-00000: data index ./crawl_fetch: part-00000 ./crawl_fetch/part-00000: data index ./crawl_parse: part-00000 ./parse_data: part-00000 ./parse_data/part-00000: data index ./parse_text: part-00000 ./parse_text/part-00000: data index => but the size of this directory is 37,1 Ko while the size of the ARC file is 60Mo, => this explains that the content segments is empty please I need your Help thanks -- -=MBB=-
