Hey Hasan Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean [your-query-string] in shell/cmd?
I guess it works.. /Jack You fetched one single website, i think On 3/6/06, Hasan Diwan <[EMAIL PROTECTED]> wrote: > Gentlemen: > On 05/03/06, Richard Braman <[EMAIL PROTECTED]> wrote: > > This sounds like your crawl didn't get anything. I have seen that > > happen when the url wasn't added right, or the filter was bad. Pipe the > > crawl to crawl.log and look in there. It should show some pages being > > fecthed. If none are being fetched, something is definaltely wrong with > > your filter or url file. > 060305 182159 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml > 060305 182200 parsing file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml > 060305 182200 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml > 060305 182200 No FS indicated, using default:local > 060305 182200 crawl started in: crawl > 060305 182200 rootUrlFile = urls > 060305 182200 threads = 15 > 060305 182200 depth = 2 > 060305 182200 Created webdb at LocalFS,/home/hdiwan/SpectraSearch/crawl/db > 060305 182200 Starting URL processing > 060305 182200 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins > 060305 182200 parsing: > /home/hdiwan/nutch-0.7.1/build/plugins/nutch-extensionpoints/plugin.xml > 060305 182200 not including: > /home/hdiwan/nutch-0.7.1/build/plugins/protocol-file > 060305 182200 not including: > /home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp > 060305 182200 not including: > /home/hdiwan/nutch-0.7.1/build/plugins/protocol-http > 060305 182200 parsing: > /home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient/plugin.xml > 060305 182200 impl: point=org.apache.nutch.protocol.Protocol > class=org.apache.nutch.protocol.httpclient.Http > 060305 182200 impl: point=org.apache.nutch.protocol.Protocol > class=org.apache.nutch.protocol.httpclient.Http > 060305 182200 parsing: > /home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml > 060305 182200 impl: point=org.apache.nutch.parse.Parser > class=org.apache.nutch.parse.html.HtmlParser > 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-js > 060305 182200 parsing: > /home/hdiwan/nutch-0.7.1/build/plugins/parse-text/plugin.xml > 060305 182200 impl: point=org.apache.nutch.parse.Parser > class=org.apache.nutch.parse.text.TextParser > 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-pdf > 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-rss > 060305 182200 not including: > /home/hdiwan/nutch-0.7.1/build/plugins/parse-msword > 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-ext > 060305 182200 not including: > /home/hdiwan/nutch-0.7.1/build/plugins/index-basic > 060305 182200 parsing: > /home/hdiwan/nutch-0.7.1/build/plugins/index-more/plugin.xml > 060305 182200 impl: point=org.apache.nutch.indexer.IndexingFilter > class=org.apache.nutch.indexer.more.MoreIndexingFilter > 060305 182200 parsing: > /home/hdiwan/nutch-0.7.1/build/plugins/query-basic/plugin.xml > 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter > class=org.apache.nutch.searcher.basic.BasicQueryFilter > 060305 182200 parsing: > /home/hdiwan/nutch-0.7.1/build/plugins/query-more/plugin.xml > 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter > class=org.apache.nutch.searcher.more.TypeQueryFilter > 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter > class=org.apache.nutch.searcher.more.DateQueryFilter > 060305 182200 parsing: > /home/hdiwan/nutch-0.7.1/build/plugins/query-site/plugin.xml > 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter > class=org.apache.nutch.searcher.site.SiteQueryFilter > 060305 182200 parsing: > /home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml > 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter > class=org.apache.nutch.searcher.url.URLQueryFilter > 060305 182200 not including: > /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex > 060305 182200 not including: > /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix > 060305 182200 not including: > /home/hdiwan/nutch-0.7.1/build/plugins/creativecommons > 060305 182200 not including: > /home/hdiwan/nutch-0.7.1/build/plugins/language-identifier > 060305 182200 not including: > /home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2 > 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology > 060305 182200 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer > 060305 182200 Added 15 pages > 060305 182200 Processing pagesByURL: Sorted 15 instructions in 0.0070 seconds. > 060305 182200 Processing pagesByURL: Sorted 2142.8571428571427 > instructions/second > 060305 182200 Processing pagesByURL: Merged to new DB containing 15 > records in 0.0040 seconds > 060305 182200 Processing pagesByURL: Merged 3750.0 records/second > 060305 182200 Processing pagesByMD5: Sorted 15 instructions in 0.0040 seconds. > 060305 182200 Processing pagesByMD5: Sorted 3750.0 instructions/second > 060305 182200 Processing pagesByMD5: Merged to new DB containing 15 > records in 0.0020 seconds > 060305 182200 Processing pagesByMD5: Merged 7500.0 records/second > 060305 182200 Processing linksByMD5: Copied file (4096 bytes) in 0.0050 secs. > 060305 182200 Processing linksByURL: Copied file (4096 bytes) in 0.0030 secs. > 060305 182200 FetchListTool started > 060305 182201 Processing pagesByURL: Sorted 15 instructions in 0.0030 seconds. > 060305 182201 Processing pagesByURL: Sorted 5000.0 instructions/second > 060305 182201 Processing pagesByURL: Merged to new DB containing 15 > records in 0.0040 seconds > 060305 182201 Processing pagesByURL: Merged 3750.0 records/second > 060305 182201 Processing pagesByMD5: Sorted 15 instructions in 0.0040 seconds. > 060305 182201 Processing pagesByMD5: Sorted 3750.0 instructions/second > 060305 182201 Processing pagesByMD5: Merged to new DB containing 15 > records in 0.0030 seconds > 060305 182201 Processing pagesByMD5: Merged 5000.0 records/second > 060305 182201 Processing linksByMD5: Copied file (4096 bytes) in 0.0010 secs. > 060305 182201 Processing linksByURL: Copied file (4096 bytes) in 0.0020 secs. > 060305 182201 Processing > /home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsorted: > Sorted 15 entries in 0.0030 seconds. > 060305 182201 Processing > /home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsorted: > Sorted 5000.0 entries/second > 060305 182201 Overall processing: Sorted 15 entries in 0.0030 seconds. > 060305 182201 Overall processing: Sorted 2.0E-4 entries/second > 060305 182201 FetchListTool completed > 060305 182201 logging at INFO > 060305 182201 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/03/04/vacation.html > 060305 182201 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/stand_up_speak_up.html > 060305 182201 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/murder_in_samarkand.html > 060305 182201 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/02/26/automating_photographic_workfl.html > 060305 182201 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/punching_at_the_sun_march_17.html > 060305 182201 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/directtv_videoondemand.html > 060305 182201 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/03/05/to_the_dear_neighbour_of_mine.html > 060305 182201 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/bsc_has_no_properties_really.html > 060305 182201 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/creative_commons_salon_march_8.html > 060305 182201 http.proxy.host = null > 060305 182201 http.proxy.port = 8118 > 060305 182201 http.timeout = 10000 > 060305 182201 http.content.limit = -1 > 060305 182201 http.agent = Spectra/200602 (Spectra; > http://hasan.wits2020.net/typo/public; [EMAIL PROTECTED]) > 060305 182201 http.auth.ntlm.username = > 060305 182201 fetcher.server.delay = 1000 > 060305 182201 http.max.delays = 100 > 060305 182201 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/03/02/gmail_whinging.html > 060305 182201 Configured Client > 060305 182201 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/nobody_likes_me.html > 060305 182202 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/virtual_hosts_suck_pt_2.html > 060305 182202 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/i_hate_hosting_providers.html > 060305 182202 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/spectras_challenge.html > 060305 182202 fetching > http://hasan.wits2020.net/~hdiwan/blog/2006/02/28/opml.html > 060305 182203 Updating /home/hdiwan/SpectraSearch/crawl/db > 060305 182203 Updating for > /home/hdiwan/SpectraSearch/crawl/segments/20060305182200 > 060305 182203 Processing document 0 > 060305 182203 Finishing update > 060305 182203 Processing pagesByURL: Sorted 15 instructions in 0.0030 seconds. > 060305 182203 Processing pagesByURL: Sorted 5000.0 instructions/second > 060305 182203 Processing pagesByURL: Merged to new DB containing 15 > records in 0.0040 seconds > 060305 182203 Processing pagesByURL: Merged 3750.0 records/second > 060305 182203 Processing pagesByMD5: Sorted 15 instructions in 0.0070 seconds. > 060305 182203 Processing pagesByMD5: Sorted 2142.8571428571427 > instructions/second > 060305 182203 Processing pagesByMD5: Merged to new DB containing 15 > records in 0.0070 seconds > 060305 182203 Processing pagesByMD5: Merged 2142.8571428571427 records/second > 060305 182203 Processing linksByMD5: Copied file (4096 bytes) in 0.0020 secs. > 060305 182203 Processing linksByURL: Copied file (4096 bytes) in 0.0020 secs. > 060305 182203 Update finished > 060305 182203 FetchListTool started > 060305 182203 Overall processing: Sorted 0 entries in 0.0 seconds. > 060305 182203 Overall processing: Sorted NaN entries/second > 060305 182203 FetchListTool completed > 060305 182203 logging at INFO > 060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/db > 060305 182204 Updating for > /home/hdiwan/SpectraSearch/crawl/segments/20060305182203 > 060305 182204 Finishing update > 060305 182204 Update finished > 060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/segments from > /home/hdiwan/SpectraSearch/crawl/db > 060305 182204 reading > /home/hdiwan/SpectraSearch/crawl/segments/20060305182200 > 060305 182204 reading > /home/hdiwan/SpectraSearch/crawl/segments/20060305182203 > 060305 182204 Sorting pages by url... > 060305 182204 Getting updated scores and anchors from db... > 060305 182204 Sorting updates by segment... > 060305 182204 Updating segments... > 060305 182204 updating > /home/hdiwan/SpectraSearch/crawl/segments/20060305182200 > 060305 182204 Done updating /home/hdiwan/SpectraSearch/crawl/segments > from /home/hdiwan/SpectraSearch/crawl/db > 060305 182204 indexing segment: > /home/hdiwan/SpectraSearch/crawl/segments/20060305182200 > 060305 182205 * Opening segment 20060305182200 > 060305 182205 * Indexing segment 20060305182200 > 060305 182205 * Optimizing index... > 060305 182205 * Moving index to NFS if needed... > 060305 182205 DONE indexing segment 20060305182200: total 15 records > in 0.031 s (Infinity rec/s). > 060305 182205 done indexing > 060305 182205 indexing segment: > /home/hdiwan/SpectraSearch/crawl/segments/20060305182203 > 060305 182205 * Opening segment 20060305182203 > 060305 182205 * Indexing segment 20060305182203 > 060305 182205 * Optimizing index... > 060305 182205 * Moving index to NFS if needed... > 060305 182205 DONE indexing segment 20060305182203: total 0 records in > 0.075 s (NaN rec/s). > 060305 182205 done indexing > 060305 182205 Reading url hashes... > 060305 182205 Sorting url hashes... > 060305 182205 Deleting url duplicates... > 060305 182205 Deleted 0 url duplicates. > 060305 182205 Reading content hashes... > 060305 182205 Sorting content hashes... > 060305 182205 Deleting content duplicates... > 060305 182205 Deleted 0 content duplicates. > 060305 182205 Duplicate deletion complete locally. Now returning to NFS... > 060305 182205 DeleteDuplicates complete > 060305 182205 Merging segment indexes... > 060305 182205 crawl finished: crawl > > That's the entire log. Hope it helps! My crawl-urlfilter.txt: > # The url filter file used by the crawl command. > > # Better for intranet crawling. > # Be sure to change MY.DOMAIN.NAME to your domain name. > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # accept hosts in any domain > +^http://([a-z0-9]*\.)*/ > > # skip everything else > -. > So, why isn't it fetching anything, if that is indeed the case? > -- > Cheers, > Hasan Diwan <[EMAIL PROTECTED]> > -- Keep Discovering ... ... http://www.jroller.com/page/jmars ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
