It seems that robots.txt in libraries.mit.edu
has a lot of restrictions. Alex. -----Original Message----- From: Chip Calhoun <ccalh...@aip.org> To: user <user@nutch.apache.org>; 'markus.jel...@openindex.io' <markus.jel...@openindex.io> Sent: Tue, Dec 20, 2011 7:28 am Subject: RE: Can't crawl a domain; can't figure out why. I just compared this against a similar crawl of a completely different domain which I know works, and you're right on both counts. The parser doesn't parse a file, and nothing is sent to the solrindexer. I tried a crawl with more documents and found that while I can get documents from mit.edu, I get absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 1.3 as well. I don't think we're dealing with truncated files. I'm willing to believe it's a parse error, but how could I tell? I've spoken with some helpful people from MIT, and they don't see a reason why this wouldn't work. Chip -----Original Message----- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, December 19, 2011 5:01 PM To: user@nutch.apache.org Subject: Re: Can't crawl a domain; can't figure out why. Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the domain you can't crawl. libraries.mit.edu seems to work, although the indexer doesn't seem to send a document in and the parser doesn't mention parsing that file. Either the file throws a parse error or is truncated or .... > I'm trying to crawl pages from a number of domains, and one of these > domains has been giving me trouble. The really irritating thing is > that it did work at least once, which led me to believe that I'd > solved the problem. I can't think of anything at this point but to > paste my log of a failed crawl and solrindex and hope that someone can > think of anything I've overlooked. Does anything look strange here? > > Thanks, > Chip > > 2011-12-19 16:31:01,010 WARN crawl.Crawl - solrUrl is not set, > indexing will be skipped... 2011-12-19 16:31:01,404 INFO crawl.Crawl > - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO > crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO > crawl.Crawl - threads = 10 > 2011-12-19 16:31:01,420 INFO crawl.Crawl - depth = 1 > 2011-12-19 16:31:01,420 INFO crawl.Crawl - solrUrl=null > 2011-12-19 16:31:01,420 INFO crawl.Crawl - topN = 500000 > 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: starting at > 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO crawl.Injector - > Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO > crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 > INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. > 2011-12-19 16:31:02,854 INFO plugin.PluginRepository - Plugins: > looking > in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository - Plugin Auto-activation mode: > [true] 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered > Plugins: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > the nutch core extension points (nutch-extensionpoints) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository - Html Parse Plug-in (parse-html) > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository - Http / Https Protocol Plug-in > (protocol-httpclient) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository - HTTP Framework (lib-http) > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository - Pass-through URL Normalizer > (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository > - Http Protocol Plug-in (protocol-http) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository - Tika Parser Plug-in (parse-tika) > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository - CyberNeko HTML Parser > (lib-nekohtml) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917 > INFO plugin.PluginRepository - URL Meta Indexing Filter > (urlmeta) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Regex URL Filter Framework (lib-regex-filter) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository - Registered Extension-Points: > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository - Nutch Protocol > (org.apache.nutch.protocol.Protocol) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository - Nutch Segment Merge Filter > (org.apache.nutch.segment.SegmentMergeFilter) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository - Nutch URL Filter > (org.apache.nutch.net.URLFilter) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository - Nutch Indexing Filter > (org.apache.nutch.indexer.IndexingFilter) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository - HTML Parse Filter > (org.apache.nutch.parse.HtmlParseFilter) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository - Nutch Content Parser > (org.apache.nutch.parse.Parser) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository - Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) 2011-12-19 16:31:02,964 INFO > regex.RegexURLNormalizer - can't find rules for scope 'inject', using > default 2011-12-19 16:31:05,722 INFO crawl.Injector - Injector: > Merging injected urls into crawl db. 2011-12-19 16:31:07,014 WARN > util.NativeCodeLoader - Unable to load native-hadoop library for your > platform... using builtin-java classes where applicable 2011-12-19 > 16:31:07,897 INFO crawl.Injector - Injector: finished at 2011-12-19 > 16:31:07, elapsed: 00:00:06 2011-12-19 16:31:07,913 INFO > crawl.Generator > - Generator: starting at 2011-12-19 16:31:07 2011-12-19 16:31:07,913 > INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. > 2011-12-19 16:31:07,913 INFO crawl.Generator - Generator: filtering: > true > 2011-12-19 16:31:07,913 INFO crawl.Generator - Generator: normalizing: > true 2011-12-19 16:31:07,913 INFO crawl.Generator - Generator: topN: > 500000 2011-12-19 16:31:07,913 INFO crawl.Generator - Generator: > jobtracker is 'local', generating exactly one partition. 2011-12-19 > 16:31:09,157 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: > org.apache.nutch.crawl.DefaultFetchSchedule 2011-12-19 16:31:09,157 > INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2011-12-19 > 16:31:09,157 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 > 2011-12-19 16:31:09,157 INFO regex.RegexURLNormalizer - can't find > rules for scope 'partition', using default 2011-12-19 16:31:09,189 > INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: > org.apache.nutch.crawl.DefaultFetchSchedule 2011-12-19 16:31:09,189 > INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2011-12-19 > 16:31:09,189 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 > 2011-12-19 16:31:09,189 INFO regex.RegexURLNormalizer - can't find > rules for scope 'generate_host_count', using default 2011-12-19 > 16:31:10,071 INFO crawl.Generator - Generator: Partitioning selected > urls for politeness. 2011-12-19 16:31:11,080 INFO crawl.Generator - Generator: > segment: mit-c-crawl/segments/20111219163111 2011-12-19 16:31:12,309 > INFO regex.RegexURLNormalizer - can't find rules for scope > 'partition', using default 2011-12-19 16:31:13,223 INFO crawl.Generator - Generator: > finished at 2011-12-19 16:31:13, elapsed: 00:00:05 2011-12-19 > 16:31:13,239 INFO fetcher.Fetcher - Fetcher: starting at 2011-12-19 > 16:31:13 > 2011-12-19 16:31:13,239 INFO fetcher.Fetcher - Fetcher: segment: > mit-c-crawl/segments/20111219163111 2011-12-19 16:31:14,515 INFO > fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,515 > INFO fetcher.Fetcher - Fetcher: threads: 10 2011-12-19 16:31:14,515 > INFO fetcher.Fetcher - Fetcher: time-out divisor: 2 2011-12-19 > 16:31:14,515 INFO fetcher.Fetcher - QueueFeeder finished: total 1 > records + hit by time limit :0 2011-12-19 16:31:14,531 INFO > fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 > INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 > 16:31:14,531 INFO fetcher.Fetcher - fetching > http://libraries.mit.edu/archives/research/collections/collections-mc/mc1. > html 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Using queue mode : > byHost 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing > thread FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO > fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 > INFO fetcher.Fetcher - -finishing thread FetcherThread, > activeThreads=1 > 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO > fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 > INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 > 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, > activeThreads=1 > 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO > fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 > INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 > 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, > activeThreads=1 > 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 2011-12-19 16:31:14,531 INFO > fetcher.Fetcher - Using queue mode : byHost 2011-12-19 16:31:14,531 > INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 > 16:31:14,531 INFO fetcher.Fetcher - -finishing thread FetcherThread, > activeThreads=1 > 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - Fetcher: throughput > threshold: -1 2011-12-19 16:31:14,531 INFO fetcher.Fetcher - > -finishing thread FetcherThread, activeThreads=1 2011-12-19 > 16:31:14,531 INFO fetcher.Fetcher - Fetcher: throughput threshold > retries: 5 2011-12-19 > 16:31:14,562 INFO httpclient.Http - http.proxy.host = null 2011-12-19 > 16:31:14,562 INFO httpclient.Http - http.proxy.port = 8080 2011-12-19 > 16:31:14,562 INFO httpclient.Http - http.timeout = 10000 2011-12-19 > 16:31:14,562 INFO httpclient.Http - http.content.limit = -1 > 2011-12-19 > 16:31:14,562 INFO httpclient.Http - http.agent = PHFAWS/Nutch-1.3 > (American Institute of Physics: Physics History Finding Aids Web Site; > http://www.aip.org/history/nbl/findingaids.html; ccalh...@aip.org) > 2011-12-19 16:31:14,562 INFO httpclient.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 2011-12-19 16:31:14,799 INFO > fetcher.Fetcher > - -finishing thread FetcherThread, activeThreads=0 2011-12-19 > 16:31:15,539 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, > fetchQueues.totalSize=0 2011-12-19 16:31:15,539 INFO fetcher.Fetcher > - > -activeThreads=0 > 2011-12-19 16:31:16,390 INFO fetcher.Fetcher - Fetcher: finished at > 2011-12-19 16:31:16, elapsed: 00:00:03 2011-12-19 16:31:16,390 INFO > parse.ParseSegment - ParseSegment: starting at 2011-12-19 16:31:16 > 2011-12-19 16:31:16,390 INFO parse.ParseSegment - ParseSegment: segment: > mit-c-crawl/segments/20111219163111 2011-12-19 16:31:18,533 INFO > parse.ParseSegment - ParseSegment: finished at 2011-12-19 16:31:18, > elapsed: 00:00:02 2011-12-19 16:31:18,549 INFO crawl.CrawlDb - > CrawlDb > update: starting at 2011-12-19 16:31:18 2011-12-19 16:31:18,549 INFO > crawl.CrawlDb - CrawlDb update: db: mit-c-crawl/crawldb 2011-12-19 > 16:31:18,549 INFO crawl.CrawlDb - CrawlDb update: segments: > [mit-c-crawl/segments/20111219163111] 2011-12-19 16:31:18,549 INFO > crawl.CrawlDb - CrawlDb update: additions allowed: true 2011-12-19 > 16:31:18,549 INFO crawl.CrawlDb - CrawlDb update: URL normalizing: > true > 2011-12-19 16:31:18,549 INFO crawl.CrawlDb - CrawlDb update: URL > filtering: true 2011-12-19 16:31:18,549 INFO crawl.CrawlDb - CrawlDb > update: 404 purging: false 2011-12-19 16:31:18,549 INFO crawl.CrawlDb > - CrawlDb update: Merging segment data into db. 2011-12-19 > 16:31:19,873 INFO regex.RegexURLNormalizer - can't find rules for > scope 'crawldb', using default 2011-12-19 16:31:20,046 INFO > regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using > default 2011-12-19 16:31:20,204 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: > org.apache.nutch.crawl.DefaultFetchSchedule 2011-12-19 16:31:20,204 > INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2011-12-19 > 16:31:20,204 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 > 2011-12-19 16:31:20,771 INFO crawl.CrawlDb - CrawlDb update: finished > at > 2011-12-19 16:31:20, elapsed: 00:00:02 2011-12-19 16:31:20,787 INFO > crawl.LinkDb - LinkDb: starting at 2011-12-19 16:31:20 2011-12-19 > 16:31:20,787 INFO crawl.LinkDb - LinkDb: linkdb: mit-c-crawl/linkdb > 2011-12-19 16:31:20,787 INFO crawl.LinkDb - LinkDb: URL normalize: > true > 2011-12-19 16:31:20,787 INFO crawl.LinkDb - LinkDb: URL filter: true > 2011-12-19 16:31:20,787 INFO crawl.LinkDb - LinkDb: adding segment: > file:/C:/apache/apache-nutch-1.4/runtime/local/mit-c-crawl/segments/20 > 1112 > 19163111 2011-12-19 16:31:22,898 INFO crawl.LinkDb - LinkDb: finished > at > 2011-12-19 16:31:22, elapsed: 00:00:02 2011-12-19 16:31:22,898 INFO > crawl.Crawl - crawl finished: mit-c-crawl 2011-12-19 16:32:08,061 INFO > solr.SolrIndexer - SolrIndexer: starting at 2011-12-19 16:32:08 > 2011-12-19 > 16:32:08,093 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: > mit-c-crawl/crawldb 2011-12-19 16:32:08,093 INFO > indexer.IndexerMapReduce > - IndexerMapReduce: linkdb: mit-c-crawl/linkdb 2011-12-19 16:32:08,093 > INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: > mit-c-crawl/segments/20111219163111 2011-12-19 16:32:09,984 WARN > util.NativeCodeLoader - Unable to load native-hadoop library for your > platform... using builtin-java classes where applicable 2011-12-19 > 16:32:10,141 INFO plugin.PluginRepository - Plugins: looking in: > C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19 > 16:32:10,220 INFO plugin.PluginRepository - Plugin Auto-activation > mode: [true] > 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - Registered > Plugins: 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - > the nutch core extension points (nutch-extensionpoints) 2011-12-19 > 16:32:10,220 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) 2011-12-19 16:32:10,220 INFO > plugin.PluginRepository - Html Parse Plug-in (parse-html) > 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - > Basic Indexing Filter (index-basic) 2011-12-19 16:32:10,220 INFO > plugin.PluginRepository - Http / Https Protocol Plug-in > (protocol-httpclient) 2011-12-19 16:32:10,220 INFO > plugin.PluginRepository - HTTP Framework (lib-http) > 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - > Regex URL Filter (urlfilter-regex) 2011-12-19 16:32:10,220 INFO > plugin.PluginRepository - Pass-through URL Normalizer > (urlnormalizer-pass) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository > - Http Protocol Plug-in (protocol-http) 2011-12-19 > 16:32:10,220 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) 2011-12-19 16:32:10,220 INFO > plugin.PluginRepository - Tika Parser Plug-in (parse-tika) > 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - > OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:32:10,220 INFO > plugin.PluginRepository - CyberNeko HTML Parser > (lib-nekohtml) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - > Anchor Indexing Filter (index-anchor) 2011-12-19 16:32:10,220 > INFO plugin.PluginRepository - URL Meta Indexing Filter > (urlmeta) 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - > Regex URL Filter Framework (lib-regex-filter) 2011-12-19 > 16:32:10,220 INFO plugin.PluginRepository - Registered Extension-Points: > 2011-12-19 16:32:10,220 INFO plugin.PluginRepository - > Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19 > 16:32:10,220 INFO plugin.PluginRepository - Nutch Protocol > (org.apache.nutch.protocol.Protocol) 2011-12-19 16:32:10,220 INFO > plugin.PluginRepository - Nutch Segment Merge Filter > (org.apache.nutch.segment.SegmentMergeFilter) 2011-12-19 16:32:10,220 INFO > plugin.PluginRepository - Nutch URL Filter > (org.apache.nutch.net.URLFilter) 2011-12-19 16:32:10,220 INFO > plugin.PluginRepository - Nutch Indexing Filter > (org.apache.nutch.indexer.IndexingFilter) 2011-12-19 16:32:10,220 INFO > plugin.PluginRepository - HTML Parse Filter > (org.apache.nutch.parse.HtmlParseFilter) 2011-12-19 16:32:10,220 INFO > plugin.PluginRepository - Nutch Content Parser > (org.apache.nutch.parse.Parser) 2011-12-19 16:32:10,220 INFO > plugin.PluginRepository - Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) 2011-12-19 16:32:10,252 INFO > indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 > 16:32:10,283 INFO anchor.AnchorIndexingFilter - Anchor deduplication > is: off > 2011-12-19 16:32:10,283 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 > 16:32:10,283 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 > 16:32:11,276 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 > 16:32:11,276 INFO anchor.AnchorIndexingFilter - Anchor deduplication > is: off > 2011-12-19 16:32:11,276 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 > 16:32:11,276 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 > 16:32:11,402 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 > 16:32:11,402 INFO anchor.AnchorIndexingFilter - Anchor deduplication > is: off > 2011-12-19 16:32:11,402 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 > 16:32:11,402 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 > 16:32:11,544 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 > 16:32:11,544 INFO anchor.AnchorIndexingFilter - Anchor deduplication > is: off > 2011-12-19 16:32:11,544 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 > 16:32:11,544 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 > 16:32:11,686 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 > 16:32:11,686 INFO anchor.AnchorIndexingFilter - Anchor deduplication > is: off > 2011-12-19 16:32:11,686 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 > 16:32:11,686 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 > 16:32:11,906 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 > 16:32:11,906 INFO anchor.AnchorIndexingFilter - Anchor deduplication > is: off > 2011-12-19 16:32:11,906 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 > 16:32:11,906 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 > 16:32:11,985 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-12-19 > 16:32:11,985 INFO anchor.AnchorIndexingFilter - Anchor deduplication > is: off > 2011-12-19 16:32:11,985 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-12-19 > 16:32:11,985 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter 2011-12-19 > 16:32:12,111 INFO solr.SolrMappingReader - source: content dest: > content > 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: site dest: > site 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: > title > dest: title 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: > host dest: host 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - > source: segment dest: segment 2011-12-19 16:32:12,111 INFO > solr.SolrMappingReader - source: boost dest: boost 2011-12-19 > 16:32:12,111 INFO solr.SolrMappingReader - source: digest dest: > digest 2011-12-19 > 16:32:12,111 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: url dest: > id 2011-12-19 16:32:12,111 INFO solr.SolrMappingReader - source: url > dest: url 2011-12-19 16:32:13,309 INFO solr.SolrIndexer - SolrIndexer: > finished at 2011-12-19 16:32:13, elapsed: 00:00:05