Hey Hasan

Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean
[your-query-string] in shell/cmd?

I guess it works..

/Jack

You fetched one single website, i think

On 3/6/06, Hasan Diwan <[EMAIL PROTECTED]> wrote:
> Gentlemen:
> On 05/03/06, Richard Braman <[EMAIL PROTECTED]> wrote:
> > This sounds like your crawl didn't get anything.  I have seen that
> > happen when the url wasn't added right, or the filter was bad.  Pipe the
> > crawl to crawl.log and look in there.  It should show some pages being
> > fecthed.  If none are being fetched, something is definaltely wrong with
> > your filter or url file.
> 060305 182159 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
> 060305 182200 parsing file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml
> 060305 182200 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
> 060305 182200 No FS indicated, using default:local
> 060305 182200 crawl started in: crawl
> 060305 182200 rootUrlFile = urls
> 060305 182200 threads = 15
> 060305 182200 depth = 2
> 060305 182200 Created webdb at LocalFS,/home/hdiwan/SpectraSearch/crawl/db
> 060305 182200 Starting URL processing
> 060305 182200 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/nutch-extensionpoints/plugin.xml
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-file
> 060305 182200 not including: 
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-http
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.httpclient.Http
> 060305 182200 impl: point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.httpclient.Http
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.html.HtmlParser
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-js
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/parse-text/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.text.TextParser
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-pdf
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-rss
> 060305 182200 not including: 
> /home/hdiwan/nutch-0.7.1/build/plugins/parse-msword
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-ext
> 060305 182200 not including: 
> /home/hdiwan/nutch-0.7.1/build/plugins/index-basic
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/index-more/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.indexer.IndexingFilter
> class=org.apache.nutch.indexer.more.MoreIndexingFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-basic/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-more/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.more.TypeQueryFilter
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.more.DateQueryFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-site/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.url.URLQueryFilter
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/creativecommons
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/language-identifier
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology
> 060305 182200 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
> 060305 182200 Added 15 pages
> 060305 182200 Processing pagesByURL: Sorted 15 instructions in 0.0070 seconds.
> 060305 182200 Processing pagesByURL: Sorted 2142.8571428571427
> instructions/second
> 060305 182200 Processing pagesByURL: Merged to new DB containing 15
> records in 0.0040 seconds
> 060305 182200 Processing pagesByURL: Merged 3750.0 records/second
> 060305 182200 Processing pagesByMD5: Sorted 15 instructions in 0.0040 seconds.
> 060305 182200 Processing pagesByMD5: Sorted 3750.0 instructions/second
> 060305 182200 Processing pagesByMD5: Merged to new DB containing 15
> records in 0.0020 seconds
> 060305 182200 Processing pagesByMD5: Merged 7500.0 records/second
> 060305 182200 Processing linksByMD5: Copied file (4096 bytes) in 0.0050 secs.
> 060305 182200 Processing linksByURL: Copied file (4096 bytes) in 0.0030 secs.
> 060305 182200 FetchListTool started
> 060305 182201 Processing pagesByURL: Sorted 15 instructions in 0.0030 seconds.
> 060305 182201 Processing pagesByURL: Sorted 5000.0 instructions/second
> 060305 182201 Processing pagesByURL: Merged to new DB containing 15
> records in 0.0040 seconds
> 060305 182201 Processing pagesByURL: Merged 3750.0 records/second
> 060305 182201 Processing pagesByMD5: Sorted 15 instructions in 0.0040 seconds.
> 060305 182201 Processing pagesByMD5: Sorted 3750.0 instructions/second
> 060305 182201 Processing pagesByMD5: Merged to new DB containing 15
> records in 0.0030 seconds
> 060305 182201 Processing pagesByMD5: Merged 5000.0 records/second
> 060305 182201 Processing linksByMD5: Copied file (4096 bytes) in 0.0010 secs.
> 060305 182201 Processing linksByURL: Copied file (4096 bytes) in 0.0020 secs.
> 060305 182201 Processing
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsorted:
> Sorted 15 entries in 0.0030 seconds.
> 060305 182201 Processing
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsorted:
> Sorted 5000.0 entries/second
> 060305 182201 Overall processing: Sorted 15 entries in 0.0030 seconds.
> 060305 182201 Overall processing: Sorted 2.0E-4 entries/second
> 060305 182201 FetchListTool completed
> 060305 182201 logging at INFO
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/04/vacation.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/stand_up_speak_up.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/murder_in_samarkand.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/26/automating_photographic_workfl.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/punching_at_the_sun_march_17.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/directtv_videoondemand.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/05/to_the_dear_neighbour_of_mine.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/bsc_has_no_properties_really.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/creative_commons_salon_march_8.html
> 060305 182201 http.proxy.host = null
> 060305 182201 http.proxy.port = 8118
> 060305 182201 http.timeout = 10000
> 060305 182201 http.content.limit = -1
> 060305 182201 http.agent = Spectra/200602 (Spectra;
> http://hasan.wits2020.net/typo/public; [EMAIL PROTECTED])
> 060305 182201 http.auth.ntlm.username =
> 060305 182201 fetcher.server.delay = 1000
> 060305 182201 http.max.delays = 100
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/02/gmail_whinging.html
> 060305 182201 Configured Client
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/nobody_likes_me.html
> 060305 182202 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/virtual_hosts_suck_pt_2.html
> 060305 182202 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/i_hate_hosting_providers.html
> 060305 182202 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/spectras_challenge.html
> 060305 182202 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/28/opml.html
> 060305 182203 Updating /home/hdiwan/SpectraSearch/crawl/db
> 060305 182203 Updating for
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182203 Processing document 0
> 060305 182203 Finishing update
> 060305 182203 Processing pagesByURL: Sorted 15 instructions in 0.0030 seconds.
> 060305 182203 Processing pagesByURL: Sorted 5000.0 instructions/second
> 060305 182203 Processing pagesByURL: Merged to new DB containing 15
> records in 0.0040 seconds
> 060305 182203 Processing pagesByURL: Merged 3750.0 records/second
> 060305 182203 Processing pagesByMD5: Sorted 15 instructions in 0.0070 seconds.
> 060305 182203 Processing pagesByMD5: Sorted 2142.8571428571427
> instructions/second
> 060305 182203 Processing pagesByMD5: Merged to new DB containing 15
> records in 0.0070 seconds
> 060305 182203 Processing pagesByMD5: Merged 2142.8571428571427 records/second
> 060305 182203 Processing linksByMD5: Copied file (4096 bytes) in 0.0020 secs.
> 060305 182203 Processing linksByURL: Copied file (4096 bytes) in 0.0020 secs.
> 060305 182203 Update finished
> 060305 182203 FetchListTool started
> 060305 182203 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060305 182203 Overall processing: Sorted NaN entries/second
> 060305 182203 FetchListTool completed
> 060305 182203 logging at INFO
> 060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/db
> 060305 182204 Updating for
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182203
> 060305 182204 Finishing update
> 060305 182204 Update finished
> 060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/segments from
> /home/hdiwan/SpectraSearch/crawl/db
> 060305 182204  reading 
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182204  reading 
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182203
> 060305 182204 Sorting pages by url...
> 060305 182204 Getting updated scores and anchors from db...
> 060305 182204 Sorting updates by segment...
> 060305 182204 Updating segments...
> 060305 182204  updating 
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182204 Done updating /home/hdiwan/SpectraSearch/crawl/segments
> from /home/hdiwan/SpectraSearch/crawl/db
> 060305 182204 indexing segment:
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182205 * Opening segment 20060305182200
> 060305 182205 * Indexing segment 20060305182200
> 060305 182205 * Optimizing index...
> 060305 182205 * Moving index to NFS if needed...
> 060305 182205 DONE indexing segment 20060305182200: total 15 records
> in 0.031 s (Infinity rec/s).
> 060305 182205 done indexing
> 060305 182205 indexing segment:
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182203
> 060305 182205 * Opening segment 20060305182203
> 060305 182205 * Indexing segment 20060305182203
> 060305 182205 * Optimizing index...
> 060305 182205 * Moving index to NFS if needed...
> 060305 182205 DONE indexing segment 20060305182203: total 0 records in
> 0.075 s (NaN rec/s).
> 060305 182205 done indexing
> 060305 182205 Reading url hashes...
> 060305 182205 Sorting url hashes...
> 060305 182205 Deleting url duplicates...
> 060305 182205 Deleted 0 url duplicates.
> 060305 182205 Reading content hashes...
> 060305 182205 Sorting content hashes...
> 060305 182205 Deleting content duplicates...
> 060305 182205 Deleted 0 content duplicates.
> 060305 182205 Duplicate deletion complete locally.  Now returning to NFS...
> 060305 182205 DeleteDuplicates complete
> 060305 182205 Merging segment indexes...
> 060305 182205 crawl finished: crawl
>
> That's the entire log. Hope it helps! My crawl-urlfilter.txt:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # accept hosts in any domain
> +^http://([a-z0-9]*\.)*/
>
> # skip everything else
> -.
> So, why isn't it fetching anything, if that is indeed the case?
> --
> Cheers,
> Hasan Diwan <[EMAIL PROTECTED]>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to