Gentlemen:
On 05/03/06, Richard Braman <[EMAIL PROTECTED]> wrote:
> This sounds like your crawl didn't get anything.  I have seen that
> happen when the url wasn't added right, or the filter was bad.  Pipe the
> crawl to crawl.log and look in there.  It should show some pages being
> fecthed.  If none are being fetched, something is definaltely wrong with
> your filter or url file.
060305 182159 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
060305 182200 parsing file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml
060305 182200 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
060305 182200 No FS indicated, using default:local
060305 182200 crawl started in: crawl
060305 182200 rootUrlFile = urls
060305 182200 threads = 15
060305 182200 depth = 2
060305 182200 Created webdb at LocalFS,/home/hdiwan/SpectraSearch/crawl/db
060305 182200 Starting URL processing
060305 182200 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/nutch-extensionpoints/plugin.xml
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-file
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-http
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient/plugin.xml
060305 182200 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.httpclient.Http
060305 182200 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.httpclient.Http
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml
060305 182200 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-js
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-text/plugin.xml
060305 182200 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-pdf
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-rss
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-msword
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-ext
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/index-basic
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/index-more/plugin.xml
060305 182200 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.more.MoreIndexingFilter
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/query-basic/plugin.xml
060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/query-more/plugin.xml
060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.more.TypeQueryFilter
060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.more.DateQueryFilter
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/query-site/plugin.xml
060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml
060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/creativecommons
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/language-identifier
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology
060305 182200 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060305 182200 Added 15 pages
060305 182200 Processing pagesByURL: Sorted 15 instructions in 0.0070 seconds.
060305 182200 Processing pagesByURL: Sorted 2142.8571428571427
instructions/second
060305 182200 Processing pagesByURL: Merged to new DB containing 15
records in 0.0040 seconds
060305 182200 Processing pagesByURL: Merged 3750.0 records/second
060305 182200 Processing pagesByMD5: Sorted 15 instructions in 0.0040 seconds.
060305 182200 Processing pagesByMD5: Sorted 3750.0 instructions/second
060305 182200 Processing pagesByMD5: Merged to new DB containing 15
records in 0.0020 seconds
060305 182200 Processing pagesByMD5: Merged 7500.0 records/second
060305 182200 Processing linksByMD5: Copied file (4096 bytes) in 0.0050 secs.
060305 182200 Processing linksByURL: Copied file (4096 bytes) in 0.0030 secs.
060305 182200 FetchListTool started
060305 182201 Processing pagesByURL: Sorted 15 instructions in 0.0030 seconds.
060305 182201 Processing pagesByURL: Sorted 5000.0 instructions/second
060305 182201 Processing pagesByURL: Merged to new DB containing 15
records in 0.0040 seconds
060305 182201 Processing pagesByURL: Merged 3750.0 records/second
060305 182201 Processing pagesByMD5: Sorted 15 instructions in 0.0040 seconds.
060305 182201 Processing pagesByMD5: Sorted 3750.0 instructions/second
060305 182201 Processing pagesByMD5: Merged to new DB containing 15
records in 0.0030 seconds
060305 182201 Processing pagesByMD5: Merged 5000.0 records/second
060305 182201 Processing linksByMD5: Copied file (4096 bytes) in 0.0010 secs.
060305 182201 Processing linksByURL: Copied file (4096 bytes) in 0.0020 secs.
060305 182201 Processing
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsorted:
Sorted 15 entries in 0.0030 seconds.
060305 182201 Processing
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsorted:
Sorted 5000.0 entries/second
060305 182201 Overall processing: Sorted 15 entries in 0.0030 seconds.
060305 182201 Overall processing: Sorted 2.0E-4 entries/second
060305 182201 FetchListTool completed
060305 182201 logging at INFO
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/04/vacation.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/stand_up_speak_up.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/murder_in_samarkand.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/26/automating_photographic_workfl.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/punching_at_the_sun_march_17.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/directtv_videoondemand.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/05/to_the_dear_neighbour_of_mine.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/bsc_has_no_properties_really.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/creative_commons_salon_march_8.html
060305 182201 http.proxy.host = null
060305 182201 http.proxy.port = 8118
060305 182201 http.timeout = 10000
060305 182201 http.content.limit = -1
060305 182201 http.agent = Spectra/200602 (Spectra;
http://hasan.wits2020.net/typo/public; [EMAIL PROTECTED])
060305 182201 http.auth.ntlm.username =
060305 182201 fetcher.server.delay = 1000
060305 182201 http.max.delays = 100
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/02/gmail_whinging.html
060305 182201 Configured Client
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/nobody_likes_me.html
060305 182202 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/virtual_hosts_suck_pt_2.html
060305 182202 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/i_hate_hosting_providers.html
060305 182202 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/spectras_challenge.html
060305 182202 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/28/opml.html
060305 182203 Updating /home/hdiwan/SpectraSearch/crawl/db
060305 182203 Updating for
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200
060305 182203 Processing document 0
060305 182203 Finishing update
060305 182203 Processing pagesByURL: Sorted 15 instructions in 0.0030 seconds.
060305 182203 Processing pagesByURL: Sorted 5000.0 instructions/second
060305 182203 Processing pagesByURL: Merged to new DB containing 15
records in 0.0040 seconds
060305 182203 Processing pagesByURL: Merged 3750.0 records/second
060305 182203 Processing pagesByMD5: Sorted 15 instructions in 0.0070 seconds.
060305 182203 Processing pagesByMD5: Sorted 2142.8571428571427
instructions/second
060305 182203 Processing pagesByMD5: Merged to new DB containing 15
records in 0.0070 seconds
060305 182203 Processing pagesByMD5: Merged 2142.8571428571427 records/second
060305 182203 Processing linksByMD5: Copied file (4096 bytes) in 0.0020 secs.
060305 182203 Processing linksByURL: Copied file (4096 bytes) in 0.0020 secs.
060305 182203 Update finished
060305 182203 FetchListTool started
060305 182203 Overall processing: Sorted 0 entries in 0.0 seconds.
060305 182203 Overall processing: Sorted NaN entries/second
060305 182203 FetchListTool completed
060305 182203 logging at INFO
060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/db
060305 182204 Updating for
/home/hdiwan/SpectraSearch/crawl/segments/20060305182203
060305 182204 Finishing update
060305 182204 Update finished
060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/segments from
/home/hdiwan/SpectraSearch/crawl/db
060305 182204  reading /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
060305 182204  reading /home/hdiwan/SpectraSearch/crawl/segments/20060305182203
060305 182204 Sorting pages by url...
060305 182204 Getting updated scores and anchors from db...
060305 182204 Sorting updates by segment...
060305 182204 Updating segments...
060305 182204  updating /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
060305 182204 Done updating /home/hdiwan/SpectraSearch/crawl/segments
from /home/hdiwan/SpectraSearch/crawl/db
060305 182204 indexing segment:
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200
060305 182205 * Opening segment 20060305182200
060305 182205 * Indexing segment 20060305182200
060305 182205 * Optimizing index...
060305 182205 * Moving index to NFS if needed...
060305 182205 DONE indexing segment 20060305182200: total 15 records
in 0.031 s (Infinity rec/s).
060305 182205 done indexing
060305 182205 indexing segment:
/home/hdiwan/SpectraSearch/crawl/segments/20060305182203
060305 182205 * Opening segment 20060305182203
060305 182205 * Indexing segment 20060305182203
060305 182205 * Optimizing index...
060305 182205 * Moving index to NFS if needed...
060305 182205 DONE indexing segment 20060305182203: total 0 records in
0.075 s (NaN rec/s).
060305 182205 done indexing
060305 182205 Reading url hashes...
060305 182205 Sorting url hashes...
060305 182205 Deleting url duplicates...
060305 182205 Deleted 0 url duplicates.
060305 182205 Reading content hashes...
060305 182205 Sorting content hashes...
060305 182205 Deleting content duplicates...
060305 182205 Deleted 0 content duplicates.
060305 182205 Duplicate deletion complete locally.  Now returning to NFS...
060305 182205 DeleteDuplicates complete
060305 182205 Merging segment indexes...
060305 182205 crawl finished: crawl

That's the entire log. Hope it helps! My crawl-urlfilter.txt:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept hosts in any domain
+^http://([a-z0-9]*\.)*/

# skip everything else
-.
So, why isn't it fetching anything, if that is indeed the case?
--
Cheers,
Hasan Diwan <[EMAIL PROTECTED]>

Reply via email to