Gentlemen: On 05/03/06, Richard Braman <[EMAIL PROTECTED]> wrote: > This sounds like your crawl didn't get anything. I have seen that > happen when the url wasn't added right, or the filter was bad. Pipe the > crawl to crawl.log and look in there. It should show some pages being > fecthed. If none are being fetched, something is definaltely wrong with > your filter or url file. 060305 182159 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml 060305 182200 parsing file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml 060305 182200 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml 060305 182200 No FS indicated, using default:local 060305 182200 crawl started in: crawl 060305 182200 rootUrlFile = urls 060305 182200 threads = 15 060305 182200 depth = 2 060305 182200 Created webdb at LocalFS,/home/hdiwan/SpectraSearch/crawl/db 060305 182200 Starting URL processing 060305 182200 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins 060305 182200 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/nutch-extensionpoints/plugin.xml 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-file 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-http 060305 182200 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient/plugin.xml 060305 182200 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.httpclient.Http 060305 182200 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.httpclient.Http 060305 182200 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml 060305 182200 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-js 060305 182200 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/parse-text/plugin.xml 060305 182200 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-pdf 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-rss 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-msword 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-ext 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/index-basic 060305 182200 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/index-more/plugin.xml 060305 182200 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.more.MoreIndexingFilter 060305 182200 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/query-basic/plugin.xml 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter 060305 182200 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/query-more/plugin.xml 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.more.TypeQueryFilter 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.more.DateQueryFilter 060305 182200 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/query-site/plugin.xml 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 060305 182200 parsing: /home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/creativecommons 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/language-identifier 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology 060305 182200 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060305 182200 Added 15 pages 060305 182200 Processing pagesByURL: Sorted 15 instructions in 0.0070 seconds. 060305 182200 Processing pagesByURL: Sorted 2142.8571428571427 instructions/second 060305 182200 Processing pagesByURL: Merged to new DB containing 15 records in 0.0040 seconds 060305 182200 Processing pagesByURL: Merged 3750.0 records/second 060305 182200 Processing pagesByMD5: Sorted 15 instructions in 0.0040 seconds. 060305 182200 Processing pagesByMD5: Sorted 3750.0 instructions/second 060305 182200 Processing pagesByMD5: Merged to new DB containing 15 records in 0.0020 seconds 060305 182200 Processing pagesByMD5: Merged 7500.0 records/second 060305 182200 Processing linksByMD5: Copied file (4096 bytes) in 0.0050 secs. 060305 182200 Processing linksByURL: Copied file (4096 bytes) in 0.0030 secs. 060305 182200 FetchListTool started 060305 182201 Processing pagesByURL: Sorted 15 instructions in 0.0030 seconds. 060305 182201 Processing pagesByURL: Sorted 5000.0 instructions/second 060305 182201 Processing pagesByURL: Merged to new DB containing 15 records in 0.0040 seconds 060305 182201 Processing pagesByURL: Merged 3750.0 records/second 060305 182201 Processing pagesByMD5: Sorted 15 instructions in 0.0040 seconds. 060305 182201 Processing pagesByMD5: Sorted 3750.0 instructions/second 060305 182201 Processing pagesByMD5: Merged to new DB containing 15 records in 0.0030 seconds 060305 182201 Processing pagesByMD5: Merged 5000.0 records/second 060305 182201 Processing linksByMD5: Copied file (4096 bytes) in 0.0010 secs. 060305 182201 Processing linksByURL: Copied file (4096 bytes) in 0.0020 secs. 060305 182201 Processing /home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsorted: Sorted 15 entries in 0.0030 seconds. 060305 182201 Processing /home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsorted: Sorted 5000.0 entries/second 060305 182201 Overall processing: Sorted 15 entries in 0.0030 seconds. 060305 182201 Overall processing: Sorted 2.0E-4 entries/second 060305 182201 FetchListTool completed 060305 182201 logging at INFO 060305 182201 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/03/04/vacation.html 060305 182201 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/stand_up_speak_up.html 060305 182201 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/murder_in_samarkand.html 060305 182201 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/26/automating_photographic_workfl.html 060305 182201 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/punching_at_the_sun_march_17.html 060305 182201 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/directtv_videoondemand.html 060305 182201 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/03/05/to_the_dear_neighbour_of_mine.html 060305 182201 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/bsc_has_no_properties_really.html 060305 182201 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/creative_commons_salon_march_8.html 060305 182201 http.proxy.host = null 060305 182201 http.proxy.port = 8118 060305 182201 http.timeout = 10000 060305 182201 http.content.limit = -1 060305 182201 http.agent = Spectra/200602 (Spectra; http://hasan.wits2020.net/typo/public; [EMAIL PROTECTED]) 060305 182201 http.auth.ntlm.username = 060305 182201 fetcher.server.delay = 1000 060305 182201 http.max.delays = 100 060305 182201 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/03/02/gmail_whinging.html 060305 182201 Configured Client 060305 182201 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/nobody_likes_me.html 060305 182202 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/virtual_hosts_suck_pt_2.html 060305 182202 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/i_hate_hosting_providers.html 060305 182202 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/spectras_challenge.html 060305 182202 fetching http://hasan.wits2020.net/~hdiwan/blog/2006/02/28/opml.html 060305 182203 Updating /home/hdiwan/SpectraSearch/crawl/db 060305 182203 Updating for /home/hdiwan/SpectraSearch/crawl/segments/20060305182200 060305 182203 Processing document 0 060305 182203 Finishing update 060305 182203 Processing pagesByURL: Sorted 15 instructions in 0.0030 seconds. 060305 182203 Processing pagesByURL: Sorted 5000.0 instructions/second 060305 182203 Processing pagesByURL: Merged to new DB containing 15 records in 0.0040 seconds 060305 182203 Processing pagesByURL: Merged 3750.0 records/second 060305 182203 Processing pagesByMD5: Sorted 15 instructions in 0.0070 seconds. 060305 182203 Processing pagesByMD5: Sorted 2142.8571428571427 instructions/second 060305 182203 Processing pagesByMD5: Merged to new DB containing 15 records in 0.0070 seconds 060305 182203 Processing pagesByMD5: Merged 2142.8571428571427 records/second 060305 182203 Processing linksByMD5: Copied file (4096 bytes) in 0.0020 secs. 060305 182203 Processing linksByURL: Copied file (4096 bytes) in 0.0020 secs. 060305 182203 Update finished 060305 182203 FetchListTool started 060305 182203 Overall processing: Sorted 0 entries in 0.0 seconds. 060305 182203 Overall processing: Sorted NaN entries/second 060305 182203 FetchListTool completed 060305 182203 logging at INFO 060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/db 060305 182204 Updating for /home/hdiwan/SpectraSearch/crawl/segments/20060305182203 060305 182204 Finishing update 060305 182204 Update finished 060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/segments from /home/hdiwan/SpectraSearch/crawl/db 060305 182204 reading /home/hdiwan/SpectraSearch/crawl/segments/20060305182200 060305 182204 reading /home/hdiwan/SpectraSearch/crawl/segments/20060305182203 060305 182204 Sorting pages by url... 060305 182204 Getting updated scores and anchors from db... 060305 182204 Sorting updates by segment... 060305 182204 Updating segments... 060305 182204 updating /home/hdiwan/SpectraSearch/crawl/segments/20060305182200 060305 182204 Done updating /home/hdiwan/SpectraSearch/crawl/segments from /home/hdiwan/SpectraSearch/crawl/db 060305 182204 indexing segment: /home/hdiwan/SpectraSearch/crawl/segments/20060305182200 060305 182205 * Opening segment 20060305182200 060305 182205 * Indexing segment 20060305182200 060305 182205 * Optimizing index... 060305 182205 * Moving index to NFS if needed... 060305 182205 DONE indexing segment 20060305182200: total 15 records in 0.031 s (Infinity rec/s). 060305 182205 done indexing 060305 182205 indexing segment: /home/hdiwan/SpectraSearch/crawl/segments/20060305182203 060305 182205 * Opening segment 20060305182203 060305 182205 * Indexing segment 20060305182203 060305 182205 * Optimizing index... 060305 182205 * Moving index to NFS if needed... 060305 182205 DONE indexing segment 20060305182203: total 0 records in 0.075 s (NaN rec/s). 060305 182205 done indexing 060305 182205 Reading url hashes... 060305 182205 Sorting url hashes... 060305 182205 Deleting url duplicates... 060305 182205 Deleted 0 url duplicates. 060305 182205 Reading content hashes... 060305 182205 Sorting content hashes... 060305 182205 Deleting content duplicates... 060305 182205 Deleted 0 content duplicates. 060305 182205 Duplicate deletion complete locally. Now returning to NFS... 060305 182205 DeleteDuplicates complete 060305 182205 Merging segment indexes... 060305 182205 crawl finished: crawl
That's the entire log. Hope it helps! My crawl-urlfilter.txt: # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept hosts in any domain +^http://([a-z0-9]*\.)*/ # skip everything else -. So, why isn't it fetching anything, if that is indeed the case? -- Cheers, Hasan Diwan <[EMAIL PROTECTED]>
