Thomas: I am using depth 4. There must be some problem, usually, crawling http://www.cs.stanford.edu in depth 4 requires more than 30 minutes.
On Wed, 30 Mar 2005 08:22:46 -0800 (PST), thomas delnoij <[EMAIL PROTECTED]> wrote: > Eric, > > what crawl depth are you using? Maybe you can try to > increase the depth. > > Please answer to the mail list, so others can help as > well. > > Rgrds, Thomas > > > --- Eric Money <[EMAIL PROTECTED]> wrote: > > > Thomas: > > > > I've also tried that way. But still when I crawl > > www.cs.stanford.edu, it > > only lasts for less than 1 min, and returns 0 > > results, maybe the problem > > is not in the nutch-urlfilter.txt? Thanks. > > > > > > > > On Tue, 29 Mar 2005 23:17:35 -0800 (PST), thomas > > delnoij > > <[EMAIL PROTECTED]> wrote: > > > Eric, > > > > > > > 2. add "+.\(msword|ps)$" into nutch-urlfiter.txt > > > > > > I think this should be +.\(doc|ps)$ > > > > > > Rgrds, Thomas > > > > > > > > > > > Then I try to crawl www.standford.edu with depth > > 4, > > > > it's amazing that > > > > the crawl finished within 1 min, but the total > > > > result is 0, what might > > > > be the problem? > > > > > > > > here is the crawling info, if you wanna take a > > look: > > > > ======================================= > > > > run java in /usr > > > > expr: syntax error > > > > 050329 143057 No NutchFileSystem indicated, so > > > > defaulting to local fs. > > > > 050329 143057 loading > > > > > > file:/Users/newuser/nutch-0.6/conf/nutch-default.xml > > > > 050329 143059 loading > > > > > > file:/Users/newuser/nutch-0.6/conf/crawl-tool.xml > > > > 050329 143059 loading > > > > > > file:/Users/newuser/nutch-0.6/conf/nutch-site.xml > > > > 050329 143059 crawl started in: crawl.03294 > > > > 050329 143059 rootUrlFile = urls > > > > 050329 143059 threads = 10 > > > > 050329 143059 depth = 4 > > > > 050329 143100 Created webdb at > > > > LocalFS,/Users/newuser/nutch-0.6/crawl.03294/db > > > > 050329 143100 Starting URL processing > > > > 050329 143100 Using URL filter: > > > > net.nutch.net.RegexURLFilter > > > > 050329 143100 found resource regex-urlfilter.txt > > at > > > > > > > > > > file:/Users/newuser/nutch-0.6/conf/regex-urlfilter.txt > > > > ...050329 143100 Added 0 pages > > > > 050329 143100 FetchListTool started > > > > 050329 143100 Overall processing: Sorted 0 > > entries > > > > in 0.0 seconds. > > > > 050329 143100 Overall processing: Sorted NaN > > > > entries/second > > > > 050329 143100 FetchListTool completed > > > > 050329 143100 Plugins: looking in: > > > > /Users/newuser/nutch-0.6/plugins > > > > 050329 143100 not including: > > > > > > /Users/newuser/nutch-0.6/plugins/clustering-carrot2 > > > > 050329 143100 not including: > > > > /Users/newuser/nutch-0.6/plugins/creativecommons > > > > 050329 143100 parsing: > > > > > > > > > > /Users/newuser/nutch-0.6/plugins/index-basic/plugin.xml > > > > 050329 143101 not including: > > > > /Users/newuser/nutch-0.6/plugins/index-more > > > > 050329 143101 not including: > > > > > > /Users/newuser/nutch-0.6/plugins/language-identifier > > > > 050329 143101 not including: > > > > /Users/newuser/nutch-0.6/plugins/ontology > > > > 050329 143101 not including: > > > > /Users/newuser/nutch-0.6/plugins/parse-ext > > > > 050329 143101 parsing: > > > > > > > > > > /Users/newuser/nutch-0.6/plugins/parse-html/plugin.xml > > > > 050329 143101 not including: > > > > /Users/newuser/nutch-0.6/plugins/parse-mp3 > > > > 050329 143101 not including: > > > > /Users/newuser/nutch-0.6/plugins/parse-msword > > > > 050329 143101 parsing: > > > > > > > > > > /Users/newuser/nutch-0.6/plugins/parse-pdf/plugin.xml > > > > 050329 143101 not including: > > > > /Users/newuser/nutch-0.6/plugins/parse-rtf > > > > 050329 143101 parsing: > > > > > > > > > > /Users/newuser/nutch-0.6/plugins/parse-text/plugin.xml > > > > 050329 143101 not including: > > > > /Users/newuser/nutch-0.6/plugins/protocol-file > > > > 050329 143101 not including: > > > > /Users/newuser/nutch-0.6/plugins/protocol-ftp > > > > 050329 143101 parsing: > > > > > > > > > > /Users/newuser/nutch-0.6/plugins/protocol-http/plugin.xml > > > > 050329 143101 parsing: > > > > > > > > > > /Users/newuser/nutch-0.6/plugins/query-basic/plugin.xml > > > > 050329 143101 parsing: > > > > > > > > > > /Users/newuser/nutch-0.6/plugins/query-site/plugin.xml > > > > 050329 143101 parsing: > > > > > > > > > > /Users/newuser/nutch-0.6/plugins/query-url/plugin.xml > > > > 050329 143101 logging at INFO > > > > 050329 143102 Updating > > > > /Users/newuser/nutch-0.6/crawl.03294/db > > > > 050329 143102 Updating for > > > > > > > > > > /Users/newuser/nutch-0.6/crawl.03294/segments/20050329143100 > > > > 050329 143103 Finishing update > > > > 050329 143103 Update finished > > > > 050329 143103 FetchListTool started > > > > 050329 143103 Overall processing: Sorted 0 > > entries > > > > in 0.0 seconds. > > > > 050329 143103 Overall processing: Sorted NaN > > > > entries/second > > > > 050329 143103 FetchListTool completed > > > > 050329 143103 logging at INFO > > > > 050329 143104 Updating > > > > /Users/newuser/nutch-0.6/crawl.03294/db > > > > 050329 143105 Updating for > > > > > > > > > > /Users/newuser/nutch-0.6/crawl.03294/segments/20050329143103 > > > > 050329 143105 Finishing update > > > > 050329 143105 Update finished > > > > 050329 143105 FetchListTool started > > > > 050329 143105 Overall processing: Sorted 0 > > entries > > > > in 0.0 seconds. > > > > 050329 143105 Overall processing: Sorted NaN > > > > entries/second > > > > 050329 143105 FetchListTool completed > > > > 050329 143105 logging at INFO > > > > 050329 143106 Updating > > > > /Users/newuser/nutch-0.6/crawl.03294/db > > > > 050329 143107 Updating for > > > > > > > > > > /Users/newuser/nutch-0.6/crawl.03294/segments/20050329143105 > > > > 050329 143107 Finishing update > > > > 050329 143107 Update finished > > > > 050329 143107 FetchListTool started > > > > 050329 143107 Overall processing: Sorted 0 > > entries > > > > in 0.0 seconds. > > > > 050329 143107 Overall processing: Sorted NaN > > > > entries/second > > > > 050329 143107 FetchListTool completed > > > > 050329 143107 logging at INFO > > > > 050329 143108 Updating > > > > /Users/newuser/nutch-0.6/crawl.03294/db > > > > 050329 143109 Updating for > > > > > > > > > > /Users/newuser/nutch-0.6/crawl.03294/segments/20050329143107 > > > > 050329 143109 Finishing update > > > > 050329 143109 Update finished > > > > 050329 143109 FetchListTool started > > > > 050329 143109 Overall processing: Sorted 0 > > entries > > > > in 0.0 seconds. > > > > 050329 143109 Overall processing: Sorted NaN > > > > entries/second > > > > 050329 143109 FetchListTool completed > > > > 050329 143110 logging at INFO > > > > 050329 143111 indexing segment: > > > > > > > > > > /Users/newuser/nutch-0.6/crawl.03294/segments/20050329143109 > > > > 050329 143111 * Opening segment 20050329143109 > > > > 050329 143111 * Indexing segment 20050329143109 > > > > 050329 143111 * Optimizing index... > > > > 050329 143111 * Moving index to NFS if needed... > > > > 050329 143111 DONE indexing segment > > 20050329143109: > > > > total 1 records in > > > === message truncated === > >
