Hi Harry, Yes indeed. It appears to work for me too. Thank you!
nutch invertlinks crawl/linkdb -dir crawl/segments LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/crawl/segments/20100415221103 LinkDb: adding segment: file:/crawl/segments/20100415221122 LinkDb: adding segment: file:/crawl/segments/20100415221141 LinkDb: adding segment: file:/crawl/segments/20100415221032 LinkDb: adding segment: file:/crawl/segments/20100415221019 LinkDb: adding segment: file:/crawl/segments/20100415221046 LinkDb: done nutch index crawl/indexes crawl/crawldb/ crawl/linkdb crawl/segments/20100415221103 crawl/segments/20100415221122 crawl/segments/20100415221141 crawl/segments/20100415221032 crawl/segments/20100415221019 crawl/segments/20100415221046 Indexer: starting Indexer: done 1) After 'verifying' that the Harry nutch 1.1 work around can complete the work for my 'small test crawl'. How do I scale the 'index step' above when the data increases > 50x and I can no longer fit the segments onto a command line? Maybe this is a 'non-issue', hopefully this will be fixed before the Nutch 1.1 Release Candidate #1 is voted in. 2) Additionally, now I have lost my ability to peer into the data structures. Both luke 0.9.9.1 and luke 1.0.1 report: "No valid directory at the location, try another location." Oh well, any suggestions to #1 or #2 w/b appreciated. Thanks again! -m. On Fri, 2010-04-16 at 08:44 +0800, Harry Nutch wrote: > I am new to nutch and still trying to figure out the code flow, however, as > a work around to issue #1, after the crawl finishes you could run linkdb and > index command separately from cygwin. > > $bin/nutch invertlinks crawl/linkdb -dir crawl/segments > > $ bin/nutch index crawl/indexes crawl/crawldb/ crawl/linkdb > crawl/segments/20100415163946 crawl/segments/20100415164106 > > This seems to work for me. You may have already tried this workaround, but > just in case. > > -Harry > > On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius > <mgris...@comcast.net>wrote: > > > Two observations using the nutch 1.1. nightly build > > nutch-2010-04-14_04-00-47: > > > > 1) Previously I was using nutch 1.0 to crawl successfully, but had > > problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which > > appears to parse all of the 'problem' pdfs that parse-pdf could not > > handle. The crawldb and segments directories are created and appear to > > be valid. However, the overall crawl does not finish now: > > > > nutch crawl urls/urls -dir crawl -depth 10 > > ... > > Fetcher: done > > CrawlDb update: starting > > CrawlDb update: db: crawl/crawldb > > CrawlDb update: segments: [crawl/segments/20100415015102] > > CrawlDb update: additions allowed: true > > CrawlDb update: URL normalizing: true > > CrawlDb update: URL filtering: true > > CrawlDb update: Merging segment data into db. > > CrawlDb update: done > > Generator: Selecting best-scoring urls due for fetch. > > Generator: starting > > Generator: filtering: true > > Generator: normalizing: true > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: 0 records selected for fetching, exiting ... > > Exception in thread "main" java.lang.NullPointerException > > at org.apache.nutch.crawl.Crawl.main(Crawl.java:133) > > > > Nutch 1.0 would complete like this: > > > > nutch crawl urls/urls -dir crawl -depth 10 > > ... > > Generator: 0 records selected for fetching, exiting ... > > Stopping at depth=7 - no more URLs to fetch. > > LinkDb: starting > > LinkDb: linkdb: crawl/linkdb > > LinkDb: URL normalize: true > > LinkDb: URL filter: true > > LinkDb: adding segment: > > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731 > > LinkDb: adding segment: > > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644 > > LinkDb: adding segment: > > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749 > > LinkDb: adding segment: > > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808 > > LinkDb: adding segment: > > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713 > > LinkDb: adding segment: > > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937 > > LinkDb: adding segment: > > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656 > > LinkDb: done > > Indexer: starting > > Indexer: done > > Dedup: starting > > Dedup: adding indexes in: crawl/indexes > > Dedup: done > > merging indexes to: crawl/index > > Adding > > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-00000 > > done merging > > crawl finished: crawl > > > > Any ideas? > > > > > > 2) if there is a 'space' in any component dir then $NUTCH_OPTS is > > invalid and causes this problem: > > > > m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> nutch > > crawl urls/urls -dir crawl -depth 10 -topN 10 > > NUTCH_OPTS: -Dhadoop.log.dir=/home/mag/Desktop/untitled > > folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log > > -Djava.library.path=/home/mag/Desktop/untitled > > folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32 > > Exception in thread "main" java.lang.NoClassDefFoundError: > > folder/nutch-2010-04-14_04-00-47/logs > > Caused by: java.lang.ClassNotFoundException: > > folder.nutch-2010-04-14_04-00-47.logs > > at java.net.URLClassLoader$1.run(URLClassLoader.java:200) > > at java.security.AccessController.doPrivileged(Native Method) > > at java.net.URLClassLoader.findClass(URLClassLoader.java:188) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:307) > > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:252) > > at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) > > Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs. > > Program will exit. > > m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> > > > > Obviously the work around is to rename 'untitled folder' to > > 'untitledFolderWithNoSpaces' > > > > Thanks, any help w/b appreciated w/ issue #1 above. > > > > -m. > > > > > >