It seems that  robots.txt in 
libraries.mit.edu 

has a lot of restrictions.

Alex.

 

-----Original Message-----
From: Chip Calhoun <ccalh...@aip.org>
To: user <user@nutch.apache.org>; 'markus.jel...@openindex.io' 
<markus.jel...@openindex.io>
Sent: Tue, Dec 20, 2011 7:28 am
Subject: RE: Can't crawl a domain; can't figure out why.


I just compared this against a similar crawl of a completely different domain 
which I know works, and you're right on both counts. The parser doesn't parse a 
file, and nothing is sent to the solrindexer. I tried a crawl with more 
documents and found that while I can get documents from mit.edu, I get 
absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 
1.3 
as well.

I don't think we're dealing with truncated files. I'm willing to believe it's a 
parse error, but how could I tell? I've spoken with some helpful people from 
MIT, and they don't see a reason why this wouldn't work.

Chip

-----Original Message-----
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, December 19, 2011 5:01 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the 
domain you can't crawl. libraries.mit.edu seems to work, although the indexer 
doesn't seem to send a document in and the parser doesn't mention parsing that 
file.

Either the file throws a parse error or is truncated or ....

> I'm trying to crawl pages from a number of domains, and one of these 
> domains has been giving me trouble. The really irritating thing is 
> that it did work at least once, which led me to believe that I'd 
> solved the problem. I can't think of anything at this point but to 
> paste my log of a failed crawl and solrindex and hope that someone can 
> think of anything I've overlooked. Does anything look strange here?
> 
> Thanks,
> Chip
> 
> 2011-12-19 16:31:01,010 WARN  crawl.Crawl - solrUrl is not set, 
> indexing will be skipped... 2011-12-19 16:31:01,404 INFO  crawl.Crawl 
> - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO  
> crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO  
> crawl.Crawl - threads = 10
> 2011-12-19 16:31:01,420 INFO  crawl.Crawl - depth = 1
> 2011-12-19 16:31:01,420 INFO  crawl.Crawl - solrUrl=null
> 2011-12-19 16:31:01,420 INFO  crawl.Crawl - topN = 500000
> 2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: starting at
> 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO  crawl.Injector -
> Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO 
> crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 
> INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
> 2011-12-19 16:31:02,854 INFO  plugin.PluginRepository - Plugins: 
> looking
> in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository - Plugin Auto-activation mode:
> [true] 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered
> Plugins: 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -          
>      the nutch core extension points (nutch-extensionpoints) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository -                Basic URL
> Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -                Html Parse Plug-in (parse-html)
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -               
> Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -                Http / Https Protocol Plug-in
> (protocol-httpclient) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -                HTTP Framework (lib-http)
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -               
> Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -                Pass-through URL Normalizer
> (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository
> -                Http Protocol Plug-in (protocol-http) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository -                Regex URL
> Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -                Tika Parser Plug-in (parse-tika)
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -               
> OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -                CyberNeko HTML Parser
> (lib-nekohtml) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -    
>            Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917
> INFO  plugin.PluginRepository -                URL Meta Indexing Filter
> (urlmeta) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -         
>       Regex URL Filter Framework (lib-regex-filter) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository - Registered Extension-Points:
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -               
> Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository -                Nutch Protocol
> (org.apache.nutch.protocol.Protocol) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -                Nutch Segment Merge Filter
> (org.apache.nutch.segment.SegmentMergeFilter) 2011-12-19 16:31:02,917 INFO
>  plugin.PluginRepository -                Nutch URL Filter
> (org.apache.nutch.net.URLFilter) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -                Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -                HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -                Nutch Content Parser
> (org.apache.nutch.parse.Parser) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -                Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter) 2011-12-19 16:31:02,964 INFO 
> regex.RegexURLNormalizer - can't find rules for scope 'inject', using 
> default 2011-12-19 16:31:05,722 INFO  crawl.Injector - Injector: 
> Merging injected urls into crawl db. 2011-12-19 16:31:07,014 WARN 
> util.NativeCodeLoader - Unable to load native-hadoop library for your 
> platform... using builtin-java classes where applicable 2011-12-19
> 16:31:07,897 INFO  crawl.Injector - Injector: finished at 2011-12-19 
> 16:31:07, elapsed: 00:00:06 2011-12-19 16:31:07,913 INFO  
> crawl.Generator
> - Generator: starting at 2011-12-19 16:31:07 2011-12-19 16:31:07,913 
> INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
> 2011-12-19 16:31:07,913 INFO  crawl.Generator - Generator: filtering: 
> true
> 2011-12-19 16:31:07,913 INFO  crawl.Generator - Generator: normalizing:
> true 2011-12-19 16:31:07,913 INFO  crawl.Generator - Generator: topN:
> 500000 2011-12-19 16:31:07,913 INFO  crawl.Generator - Generator:
> jobtracker is 'local', generating exactly one partition. 2011-12-19
> 16:31:09,157 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl:
> org.apache.nutch.crawl.DefaultFetchSchedule 2011-12-19 16:31:09,157 
> INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2011-12-19
> 16:31:09,157 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
> 2011-12-19 16:31:09,157 INFO  regex.RegexURLNormalizer - can't find 
> rules for scope 'partition', using default 2011-12-19 16:31:09,189 
> INFO crawl.FetchScheduleFactory - Using FetchSchedule impl:
> org.apache.nutch.crawl.DefaultFetchSchedule 2011-12-19 16:31:09,189 
> INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2011-12-19
> 16:31:09,189 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
> 2011-12-19 16:31:09,189 INFO  regex.RegexURLNormalizer - can't find 
> rules for scope 'generate_host_count', using default 2011-12-19 
> 16:31:10,071 INFO  crawl.Generator - Generator: Partitioning selected 
> urls for politeness. 2011-12-19 16:31:11,080 INFO  crawl.Generator - 
Generator:
> segment: mit-c-crawl/segments/20111219163111 2011-12-19 16:31:12,309 
> INFO regex.RegexURLNormalizer - can't find rules for scope 
> 'partition', using default 2011-12-19 16:31:13,223 INFO  crawl.Generator - 
Generator:
> finished at 2011-12-19 16:31:13, elapsed: 00:00:05 2011-12-19 
> 16:31:13,239 INFO  fetcher.Fetcher - Fetcher: starting at 2011-12-19 
> 16:31:13
> 2011-12-19 16:31:13,239 INFO  fetcher.Fetcher - Fetcher: segment:
> mit-c-crawl/segments/20111219163111 2011-12-19 16:31:14,515 INFO 
> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,515 
> INFO fetcher.Fetcher - Fetcher: threads: 10 2011-12-19 16:31:14,515 
> INFO fetcher.Fetcher - Fetcher: time-out divisor: 2 2011-12-19 
> 16:31:14,515 INFO  fetcher.Fetcher - QueueFeeder finished: total 1 
> records + hit by time limit :0 2011-12-19 16:31:14,531 INFO  
> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 
> INFO  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 
> 16:31:14,531 INFO  fetcher.Fetcher - fetching 
> http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.
> html 2011-12-19 16:31:14,531 INFO  fetcher.Fetcher - Using queue mode :
> byHost 2011-12-19 16:31:14,531 INFO  fetcher.Fetcher - -finishing 
> thread FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO 
> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 
> INFO fetcher.Fetcher - -finishing thread FetcherThread, 
> activeThreads=1
> 2011-12-19 16:31:14,531 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO 
> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 
> INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 
> 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, 
> activeThreads=1
> 2011-12-19 16:31:14,531 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO 
> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 
> INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 
> 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, 
> activeThreads=1
> 2011-12-19 16:31:14,531 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO 
> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 
> INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 
> 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, 
> activeThreads=1
> 2011-12-19 16:31:14,531 INFO  fetcher.Fetcher - Fetcher: throughput
> threshold: -1 2011-12-19 16:31:14,531 INFO  fetcher.Fetcher - 
> -finishing thread FetcherThread, activeThreads=1 2011-12-19 
> 16:31:14,531 INFO fetcher.Fetcher - Fetcher: throughput threshold 
> retries: 5 2011-12-19
> 16:31:14,562 INFO  httpclient.Http - http.proxy.host = null 2011-12-19
> 16:31:14,562 INFO  httpclient.Http - http.proxy.port = 8080 2011-12-19
> 16:31:14,562 INFO  httpclient.Http - http.timeout = 10000 2011-12-19
> 16:31:14,562 INFO  httpclient.Http - http.content.limit = -1 
> 2011-12-19
> 16:31:14,562 INFO  httpclient.Http - http.agent = PHFAWS/Nutch-1.3 
> (American Institute of Physics: Physics History Finding Aids Web Site; 
> http://www.aip.org/history/nbl/findingaids.html; ccalh...@aip.org)
> 2011-12-19 16:31:14,562 INFO  httpclient.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3 2011-12-19 16:31:14,799 INFO  
> fetcher.Fetcher
> - -finishing thread FetcherThread, activeThreads=0 2011-12-19 
> 16:31:15,539 INFO  fetcher.Fetcher - -activeThreads=0, spinWaiting=0,
> fetchQueues.totalSize=0 2011-12-19 16:31:15,539 INFO  fetcher.Fetcher 
> -
> -activeThreads=0
> 2011-12-19 16:31:16,390 INFO  fetcher.Fetcher - Fetcher: finished at
> 2011-12-19 16:31:16, elapsed: 00:00:03 2011-12-19 16:31:16,390 INFO 
> parse.ParseSegment - ParseSegment: starting at 2011-12-19 16:31:16
> 2011-12-19 16:31:16,390 INFO  parse.ParseSegment - ParseSegment: segment:
> mit-c-crawl/segments/20111219163111 2011-12-19 16:31:18,533 INFO 
> parse.ParseSegment - ParseSegment: finished at 2011-12-19 16:31:18,
> elapsed: 00:00:02 2011-12-19 16:31:18,549 INFO  crawl.CrawlDb - 
> CrawlDb
> update: starting at 2011-12-19 16:31:18 2011-12-19 16:31:18,549 INFO 
> crawl.CrawlDb - CrawlDb update: db: mit-c-crawl/crawldb 2011-12-19
> 16:31:18,549 INFO  crawl.CrawlDb - CrawlDb update: segments:
> [mit-c-crawl/segments/20111219163111] 2011-12-19 16:31:18,549 INFO 
> crawl.CrawlDb - CrawlDb update: additions allowed: true 2011-12-19
> 16:31:18,549 INFO  crawl.CrawlDb - CrawlDb update: URL normalizing: 
> true
> 2011-12-19 16:31:18,549 INFO  crawl.CrawlDb - CrawlDb update: URL
> filtering: true 2011-12-19 16:31:18,549 INFO  crawl.CrawlDb - CrawlDb
> update: 404 purging: false 2011-12-19 16:31:18,549 INFO  crawl.CrawlDb 
> - CrawlDb update: Merging segment data into db. 2011-12-19 
> 16:31:19,873 INFO  regex.RegexURLNormalizer - can't find rules for 
> scope 'crawldb', using default 2011-12-19 16:31:20,046 INFO  
> regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using 
> default 2011-12-19 16:31:20,204 INFO  crawl.FetchScheduleFactory - Using 
FetchSchedule impl:
> org.apache.nutch.crawl.DefaultFetchSchedule 2011-12-19 16:31:20,204 
> INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2011-12-19
> 16:31:20,204 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
> 2011-12-19 16:31:20,771 INFO  crawl.CrawlDb - CrawlDb update: finished 
> at
> 2011-12-19 16:31:20, elapsed: 00:00:02 2011-12-19 16:31:20,787 INFO 
> crawl.LinkDb - LinkDb: starting at 2011-12-19 16:31:20 2011-12-19
> 16:31:20,787 INFO  crawl.LinkDb - LinkDb: linkdb: mit-c-crawl/linkdb
> 2011-12-19 16:31:20,787 INFO  crawl.LinkDb - LinkDb: URL normalize: 
> true
> 2011-12-19 16:31:20,787 INFO  crawl.LinkDb - LinkDb: URL filter: true
> 2011-12-19 16:31:20,787 INFO  crawl.LinkDb - LinkDb: adding segment:
> file:/C:/apache/apache-nutch-1.4/runtime/local/mit-c-crawl/segments/20
> 1112
> 19163111 2011-12-19 16:31:22,898 INFO  crawl.LinkDb - LinkDb: finished 
> at
> 2011-12-19 16:31:22, elapsed: 00:00:02 2011-12-19 16:31:22,898 INFO 
> crawl.Crawl - crawl finished: mit-c-crawl 2011-12-19 16:32:08,061 INFO 
> solr.SolrIndexer - SolrIndexer: starting at 2011-12-19 16:32:08 
> 2011-12-19
> 16:32:08,093 INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb:
> mit-c-crawl/crawldb 2011-12-19 16:32:08,093 INFO  
> indexer.IndexerMapReduce
> - IndexerMapReduce: linkdb: mit-c-crawl/linkdb 2011-12-19 16:32:08,093 
> INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment:
> mit-c-crawl/segments/20111219163111 2011-12-19 16:32:09,984 WARN 
> util.NativeCodeLoader - Unable to load native-hadoop library for your 
> platform... using builtin-java classes where applicable 2011-12-19
> 16:32:10,141 INFO  plugin.PluginRepository - Plugins: looking in:
> C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19 
> 16:32:10,220 INFO  plugin.PluginRepository - Plugin Auto-activation 
> mode: [true]
> 2011-12-19 16:32:10,220 INFO  plugin.PluginRepository - Registered
> Plugins: 2011-12-19 16:32:10,220 INFO  plugin.PluginRepository -          
>      the nutch core extension points (nutch-extensionpoints) 2011-12-19
> 16:32:10,220 INFO  plugin.PluginRepository -                Basic URL
> Normalizer (urlnormalizer-basic) 2011-12-19 16:32:10,220 INFO 
> plugin.PluginRepository -                Html Parse Plug-in (parse-html)
> 2011-12-19 16:32:10,220 INFO  plugin.PluginRepository -               
> Basic Indexing Filter (index-basic) 2011-12-19 16:32:10,220 INFO 
> plugin.PluginRepository -                Http / Https Protocol Plug-in
> (protocol-httpclient) 2011-12-19 16:32:10,220 INFO 
> plugin.PluginRepository -                HTTP Framework (lib-http)
> 2011-12-19 16:32:10,220 INFO  plugin.PluginRepository -               
> Regex URL Filter (urlfilter-regex) 2011-12-19 16:32:10,220 INFO 
> plugin.PluginRepository -                Pass-through URL Normalizer
> (urlnormalizer-pass) 2011-12-19 16:32:10,220 INFO  plugin.PluginRepository
> -                Http Protocol Plug-in (protocol-http) 2011-12-19
> 16:32:10,220 INFO  plugin.PluginRepository -                Regex URL
> Normalizer (urlnormalizer-regex) 2011-12-19 16:32:10,220 INFO 
> plugin.PluginRepository -                Tika Parser Plug-in (parse-tika)
> 2011-12-19 16:32:10,220 INFO  plugin.PluginRepository -               
> OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:32:10,220 INFO 
> plugin.PluginRepository -                CyberNeko HTML Parser
> (lib-nekohtml) 2011-12-19 16:32:10,220 INFO  plugin.PluginRepository -    
>            Anchor Indexing Filter (index-anchor) 2011-12-19 16:32:10,220
> INFO  plugin.PluginRepository -                URL Meta Indexing Filter
> (urlmeta) 2011-12-19 16:32:10,220 INFO  plugin.PluginRepository -         
>       Regex URL Filter Framework (lib-regex-filter) 2011-12-19
> 16:32:10,220 INFO  plugin.PluginRepository - Registered Extension-Points:
> 2011-12-19 16:32:10,220 INFO  plugin.PluginRepository -               
> Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19
> 16:32:10,220 INFO  plugin.PluginRepository -                Nutch Protocol
> (org.apache.nutch.protocol.Protocol) 2011-12-19 16:32:10,220 INFO 
> plugin.PluginRepository -                Nutch Segment Merge Filter
> (org.apache.nutch.segment.SegmentMergeFilter) 2011-12-19 16:32:10,220 INFO
>  plugin.PluginRepository -                Nutch URL Filter
> (org.apache.nutch.net.URLFilter) 2011-12-19 16:32:10,220 INFO 
> plugin.PluginRepository -                Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter) 2011-12-19 16:32:10,220 INFO 
> plugin.PluginRepository -                HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter) 2011-12-19 16:32:10,220 INFO 
> plugin.PluginRepository -                Nutch Content Parser
> (org.apache.nutch.parse.Parser) 2011-12-19 16:32:10,220 INFO 
> plugin.PluginRepository -                Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter) 2011-12-19 16:32:10,252 INFO 
> indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 
> 16:32:10,283 INFO  anchor.AnchorIndexingFilter - Anchor deduplication 
> is: off
> 2011-12-19 16:32:10,283 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19
> 16:32:10,283 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19
> 16:32:11,276 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 
> 16:32:11,276 INFO  anchor.AnchorIndexingFilter - Anchor deduplication 
> is: off
> 2011-12-19 16:32:11,276 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19
> 16:32:11,276 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19
> 16:32:11,402 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 
> 16:32:11,402 INFO  anchor.AnchorIndexingFilter - Anchor deduplication 
> is: off
> 2011-12-19 16:32:11,402 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19
> 16:32:11,402 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19
> 16:32:11,544 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 
> 16:32:11,544 INFO  anchor.AnchorIndexingFilter - Anchor deduplication 
> is: off
> 2011-12-19 16:32:11,544 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19
> 16:32:11,544 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19
> 16:32:11,686 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 
> 16:32:11,686 INFO  anchor.AnchorIndexingFilter - Anchor deduplication 
> is: off
> 2011-12-19 16:32:11,686 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19
> 16:32:11,686 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19
> 16:32:11,906 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 
> 16:32:11,906 INFO  anchor.AnchorIndexingFilter - Anchor deduplication 
> is: off
> 2011-12-19 16:32:11,906 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19
> 16:32:11,906 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19
> 16:32:11,985 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 
> 16:32:11,985 INFO  anchor.AnchorIndexingFilter - Anchor deduplication 
> is: off
> 2011-12-19 16:32:11,985 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19
> 16:32:11,985 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19
> 16:32:12,111 INFO  solr.SolrMappingReader - source: content dest: 
> content
> 2011-12-19 16:32:12,111 INFO  solr.SolrMappingReader - source: site dest:
> site 2011-12-19 16:32:12,111 INFO  solr.SolrMappingReader - source: 
> title
> dest: title 2011-12-19 16:32:12,111 INFO  solr.SolrMappingReader - source:
> host dest: host 2011-12-19 16:32:12,111 INFO  solr.SolrMappingReader -
> source: segment dest: segment 2011-12-19 16:32:12,111 INFO 
> solr.SolrMappingReader - source: boost dest: boost 2011-12-19 
> 16:32:12,111 INFO  solr.SolrMappingReader - source: digest dest: 
> digest 2011-12-19
> 16:32:12,111 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2011-12-19 16:32:12,111 INFO  solr.SolrMappingReader - source: url dest:
> id 2011-12-19 16:32:12,111 INFO  solr.SolrMappingReader - source: url
> dest: url 2011-12-19 16:32:13,309 INFO  solr.SolrIndexer - SolrIndexer:
> finished at 2011-12-19 16:32:13, elapsed: 00:00:05

 

Reply via email to