Re: nutch 1.1 crawl d/n complete issue

matthew a. grisius Thu, 15 Apr 2010 20:02:23 -0700

Hi Harry,

Yes indeed. It appears to work for me too. Thank you!


nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/crawl/segments/20100415221103
LinkDb: adding segment: file:/crawl/segments/20100415221122
LinkDb: adding segment: file:/crawl/segments/20100415221141
LinkDb: adding segment: file:/crawl/segments/20100415221032
LinkDb: adding segment: file:/crawl/segments/20100415221019
LinkDb: adding segment: file:/crawl/segments/20100415221046
LinkDb: done

nutch index crawl/indexes crawl/crawldb/  crawl/linkdb
crawl/segments/20100415221103 crawl/segments/20100415221122
crawl/segments/20100415221141 crawl/segments/20100415221032
crawl/segments/20100415221019 crawl/segments/20100415221046
Indexer: starting
Indexer: done

1) After 'verifying' that the Harry nutch 1.1 work around can complete
the work for my 'small test crawl'. How do I scale the 'index step'
above when the data increases > 50x and I can no longer fit the segments
onto a command line? Maybe this is a 'non-issue', hopefully this will be
fixed before the Nutch 1.1 Release Candidate #1 is voted in.

2) Additionally, now I have lost my ability to peer into the data
structures. Both luke 0.9.9.1 and luke 1.0.1 report:

"No valid directory at the location, try another location."

Oh well, any suggestions to #1 or #2 w/b appreciated. Thanks again!

-m.


On Fri, 2010-04-16 at 08:44 +0800, Harry Nutch wrote:
> I am new to nutch and still trying to figure out the code flow, however, as
> a work around to issue #1, after the crawl finishes you could run linkdb and
> index command separately from cygwin.
> 
> $bin/nutch invertlinks crawl/linkdb -dir crawl/segments
> 
> $ bin/nutch index crawl/indexes crawl/crawldb/  crawl/linkdb
> crawl/segments/20100415163946  crawl/segments/20100415164106
> 
> This seems to work for me. You may have already tried this workaround, but
> just in case.
> 
> -Harry
> 
> On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius 
> <mgris...@comcast.net>wrote:
> 
> > Two observations using the nutch 1.1. nightly build
> > nutch-2010-04-14_04-00-47:
> >
> > 1) Previously I was using nutch 1.0 to crawl successfully, but had
> > problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which
> > appears to parse all of the 'problem' pdfs that parse-pdf could not
> > handle. The crawldb and segments directories are created and appear to
> > be valid. However, the overall crawl does not finish now:
> >
> > nutch crawl urls/urls -dir crawl -depth 10
> > ...
> > Fetcher: done
> > CrawlDb update: starting
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/segments/20100415015102]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: true
> > CrawlDb update: URL filtering: true
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: starting
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: 0 records selected for fetching, exiting ...
> > Exception in thread "main" java.lang.NullPointerException
> >        at org.apache.nutch.crawl.Crawl.main(Crawl.java:133)
> >
> > Nutch 1.0 would complete like this:
> >
> > nutch crawl urls/urls -dir crawl -depth 10
> > ...
> > Generator: 0 records selected for fetching, exiting ...
> > Stopping at depth=7 - no more URLs to fetch.
> > LinkDb: starting
> > LinkDb: linkdb: crawl/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment:
> > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731
> > LinkDb: adding segment:
> > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644
> > LinkDb: adding segment:
> > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749
> > LinkDb: adding segment:
> > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808
> > LinkDb: adding segment:
> > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713
> > LinkDb: adding segment:
> > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937
> > LinkDb: adding segment:
> > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656
> > LinkDb: done
> > Indexer: starting
> > Indexer: done
> > Dedup: starting
> > Dedup: adding indexes in: crawl/indexes
> > Dedup: done
> > merging indexes to: crawl/index
> > Adding
> > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-00000
> > done merging
> > crawl finished: crawl
> >
> > Any ideas?
> >
> >
> > 2) if there is a 'space' in any component dir then $NUTCH_OPTS is
> > invalid and causes this problem:
> >
> > m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> nutch
> > crawl urls/urls -dir crawl -depth 10 -topN 10
> > NUTCH_OPTS:  -Dhadoop.log.dir=/home/mag/Desktop/untitled
> > folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log
> > -Djava.library.path=/home/mag/Desktop/untitled
> > folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > folder/nutch-2010-04-14_04-00-47/logs
> > Caused by: java.lang.ClassNotFoundException:
> > folder.nutch-2010-04-14_04-00-47.logs
> >        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
> >        at java.security.AccessController.doPrivileged(Native Method)
> >        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
> >        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
> >        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> >        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
> >        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
> > Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs.
> > Program will exit.
> > m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin>
> >
> > Obviously the work around is to rename 'untitled folder' to
> > 'untitledFolderWithNoSpaces'
> >
> > Thanks, any help w/b appreciated w/ issue #1 above.
> >
> > -m.
> >
> >
> >

Re: nutch 1.1 crawl d/n complete issue

Reply via email to