Thomas:

I am using depth 4. There must be some problem, usually,
crawling http://www.cs.stanford.edu in depth 4 requires
more than 30 minutes.



On Wed, 30 Mar 2005 08:22:46 -0800 (PST), thomas delnoij
<[EMAIL PROTECTED]> wrote:
> Eric,
> 
> what crawl depth are you using? Maybe you can try to
> increase the depth.
> 
> Please answer to the mail list, so others can help as
> well.
> 
> Rgrds, Thomas
> 
> 
> --- Eric Money <[EMAIL PROTECTED]> wrote:
> 
> > Thomas:
> >
> > I've also tried that way. But still when I crawl
> > www.cs.stanford.edu, it
> > only lasts for less than 1 min, and returns 0
> > results, maybe the problem
> > is not in the nutch-urlfilter.txt? Thanks.
> >
> >
> >
> > On Tue, 29 Mar 2005 23:17:35 -0800 (PST), thomas
> > delnoij
> > <[EMAIL PROTECTED]> wrote:
> > > Eric,
> > >
> > > > 2. add "+.\(msword|ps)$" into nutch-urlfiter.txt
> > >
> > > I think this should be +.\(doc|ps)$
> > >
> > > Rgrds, Thomas
> > >
> > > >
> > > > Then I try to crawl www.standford.edu with depth
> > 4,
> > > > it's amazing that
> > > > the crawl finished within 1 min, but the total
> > > > result is 0, what might
> > > > be the problem?
> > > >
> > > > here is the crawling info, if you wanna take a
> > look:
> > > > =======================================
> > > > run java in /usr
> > > > expr: syntax error
> > > > 050329 143057 No NutchFileSystem indicated, so
> > > > defaulting to local fs.
> > > > 050329 143057 loading
> > > >
> > file:/Users/newuser/nutch-0.6/conf/nutch-default.xml
> > > > 050329 143059 loading
> > > >
> > file:/Users/newuser/nutch-0.6/conf/crawl-tool.xml
> > > > 050329 143059 loading
> > > >
> > file:/Users/newuser/nutch-0.6/conf/nutch-site.xml
> > > > 050329 143059 crawl started in: crawl.03294
> > > > 050329 143059 rootUrlFile = urls
> > > > 050329 143059 threads = 10
> > > > 050329 143059 depth = 4
> > > > 050329 143100 Created webdb at
> > > > LocalFS,/Users/newuser/nutch-0.6/crawl.03294/db
> > > > 050329 143100 Starting URL processing
> > > > 050329 143100 Using URL filter:
> > > > net.nutch.net.RegexURLFilter
> > > > 050329 143100 found resource regex-urlfilter.txt
> > at
> > > >
> > >
> >
> file:/Users/newuser/nutch-0.6/conf/regex-urlfilter.txt
> > > > ...050329 143100 Added 0 pages
> > > > 050329 143100 FetchListTool started
> > > > 050329 143100 Overall processing: Sorted 0
> > entries
> > > > in 0.0 seconds.
> > > > 050329 143100 Overall processing: Sorted NaN
> > > > entries/second
> > > > 050329 143100 FetchListTool completed
> > > > 050329 143100 Plugins: looking in:
> > > > /Users/newuser/nutch-0.6/plugins
> > > > 050329 143100 not including:
> > > >
> > /Users/newuser/nutch-0.6/plugins/clustering-carrot2
> > > > 050329 143100 not including:
> > > > /Users/newuser/nutch-0.6/plugins/creativecommons
> > > > 050329 143100 parsing:
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/plugins/index-basic/plugin.xml
> > > > 050329 143101 not including:
> > > > /Users/newuser/nutch-0.6/plugins/index-more
> > > > 050329 143101 not including:
> > > >
> > /Users/newuser/nutch-0.6/plugins/language-identifier
> > > > 050329 143101 not including:
> > > > /Users/newuser/nutch-0.6/plugins/ontology
> > > > 050329 143101 not including:
> > > > /Users/newuser/nutch-0.6/plugins/parse-ext
> > > > 050329 143101 parsing:
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/plugins/parse-html/plugin.xml
> > > > 050329 143101 not including:
> > > > /Users/newuser/nutch-0.6/plugins/parse-mp3
> > > > 050329 143101 not including:
> > > > /Users/newuser/nutch-0.6/plugins/parse-msword
> > > > 050329 143101 parsing:
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/plugins/parse-pdf/plugin.xml
> > > > 050329 143101 not including:
> > > > /Users/newuser/nutch-0.6/plugins/parse-rtf
> > > > 050329 143101 parsing:
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/plugins/parse-text/plugin.xml
> > > > 050329 143101 not including:
> > > > /Users/newuser/nutch-0.6/plugins/protocol-file
> > > > 050329 143101 not including:
> > > > /Users/newuser/nutch-0.6/plugins/protocol-ftp
> > > > 050329 143101 parsing:
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/plugins/protocol-http/plugin.xml
> > > > 050329 143101 parsing:
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/plugins/query-basic/plugin.xml
> > > > 050329 143101 parsing:
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/plugins/query-site/plugin.xml
> > > > 050329 143101 parsing:
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/plugins/query-url/plugin.xml
> > > > 050329 143101 logging at INFO
> > > > 050329 143102 Updating
> > > > /Users/newuser/nutch-0.6/crawl.03294/db
> > > > 050329 143102 Updating for
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/crawl.03294/segments/20050329143100
> > > > 050329 143103 Finishing update
> > > > 050329 143103 Update finished
> > > > 050329 143103 FetchListTool started
> > > > 050329 143103 Overall processing: Sorted 0
> > entries
> > > > in 0.0 seconds.
> > > > 050329 143103 Overall processing: Sorted NaN
> > > > entries/second
> > > > 050329 143103 FetchListTool completed
> > > > 050329 143103 logging at INFO
> > > > 050329 143104 Updating
> > > > /Users/newuser/nutch-0.6/crawl.03294/db
> > > > 050329 143105 Updating for
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/crawl.03294/segments/20050329143103
> > > > 050329 143105 Finishing update
> > > > 050329 143105 Update finished
> > > > 050329 143105 FetchListTool started
> > > > 050329 143105 Overall processing: Sorted 0
> > entries
> > > > in 0.0 seconds.
> > > > 050329 143105 Overall processing: Sorted NaN
> > > > entries/second
> > > > 050329 143105 FetchListTool completed
> > > > 050329 143105 logging at INFO
> > > > 050329 143106 Updating
> > > > /Users/newuser/nutch-0.6/crawl.03294/db
> > > > 050329 143107 Updating for
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/crawl.03294/segments/20050329143105
> > > > 050329 143107 Finishing update
> > > > 050329 143107 Update finished
> > > > 050329 143107 FetchListTool started
> > > > 050329 143107 Overall processing: Sorted 0
> > entries
> > > > in 0.0 seconds.
> > > > 050329 143107 Overall processing: Sorted NaN
> > > > entries/second
> > > > 050329 143107 FetchListTool completed
> > > > 050329 143107 logging at INFO
> > > > 050329 143108 Updating
> > > > /Users/newuser/nutch-0.6/crawl.03294/db
> > > > 050329 143109 Updating for
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/crawl.03294/segments/20050329143107
> > > > 050329 143109 Finishing update
> > > > 050329 143109 Update finished
> > > > 050329 143109 FetchListTool started
> > > > 050329 143109 Overall processing: Sorted 0
> > entries
> > > > in 0.0 seconds.
> > > > 050329 143109 Overall processing: Sorted NaN
> > > > entries/second
> > > > 050329 143109 FetchListTool completed
> > > > 050329 143110 logging at INFO
> > > > 050329 143111 indexing segment:
> > > >
> > >
> >
> /Users/newuser/nutch-0.6/crawl.03294/segments/20050329143109
> > > > 050329 143111 * Opening segment 20050329143109
> > > > 050329 143111 * Indexing segment 20050329143109
> > > > 050329 143111 * Optimizing index...
> > > > 050329 143111 * Moving index to NFS if needed...
> > > > 050329 143111 DONE indexing segment
> > 20050329143109:
> > > > total 1 records in
> >
> === message truncated ===
> 
>

Reply via email to