[
https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405639#comment-13405639
]
Arijit Mukherjee commented on NUTCH-1418:
-----------------------------------------
Hi,
I have seen that the urls mentioned in the url
http://en.wikipedia.org/wiki/Districts_of_India are not picked up in the
fetch/parse process into outlinks. However, the parsechecker is able to pick
all the links into outlink. On looking through the hadoop.log, I concluded that
this is the only issue in fetch - and thereafter fetch bails out. So, I believe
that fetch bails out on seeing this WARN.
I have copy-pasted the contents of my hadoop.log - which contains the log
from fetch (where the WARN occurs) as well as the log from parsechecker.
=============================hadoop.log=========================================================
2012-07-02 16:40:35,300 INFO crawl.Injector - Injector: starting at 2012-07-02
16:40:35
2012-07-02 16:40:35,301 INFO crawl.Injector - Injector: crawlDb:
/root/arijit/crawler/crawl/crawldb
2012-07-02 16:40:35,301 INFO crawl.Injector - Injector: urlDir:
/root/arijit/crawler/urls
2012-07-02 16:40:35,301 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2012-07-02 16:40:35,863 INFO plugin.PluginRepository - Plugins: looking in:
/root/arijit/apache-nutch-1.4-bin/runtime/local/plugins
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Registered Plugins:
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Tika Parser
Plug-in (parse-tika)
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Anchor Indexing
Filter (index-anchor)
2012-07-02 16:40:35,994 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Registered
Extension-Points:
2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch Segment
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2012-07-02 16:40:35,994 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2012-07-02 16:40:36,070 INFO regex.RegexURLNormalizer - can't find rules for
scope 'inject', using default
2012-07-02 16:40:36,696 INFO crawl.Injector - Injector: Merging injected urls
into crawl db.
2012-07-02 16:40:36,999 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2012-07-02 16:40:37,880 INFO crawl.Injector - Injector: finished at 2012-07-02
16:40:37, elapsed: 00:00:02
2012-07-02 16:40:41,619 INFO crawl.Generator - Generator: starting at
2012-07-02 16:40:41
2012-07-02 16:40:41,619 INFO crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2012-07-02 16:40:41,619 INFO crawl.Generator - Generator: filtering: true
2012-07-02 16:40:41,620 INFO crawl.Generator - Generator: normalizing: true
2012-07-02 16:40:41,621 INFO crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2012-07-02 16:40:42,099 INFO plugin.PluginRepository - Plugins: looking in:
/root/arijit/apache-nutch-1.4-bin/runtime/local/plugins
2012-07-02 16:40:42,235 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Registered Plugins:
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Tika Parser
Plug-in (parse-tika)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Anchor Indexing
Filter (index-anchor)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Registered
Extension-Points:
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Nutch Segment
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2012-07-02 16:40:42,236 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2012-07-02 16:40:42,290 INFO crawl.FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-07-02 16:40:42,290 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-07-02 16:40:42,290 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2012-07-02 16:40:42,311 INFO regex.RegexURLNormalizer - can't find rules for
scope 'partition', using default
2012-07-02 16:40:42,392 INFO crawl.FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-07-02 16:40:42,392 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-07-02 16:40:42,392 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2012-07-02 16:40:42,394 INFO regex.RegexURLNormalizer - can't find rules for
scope 'generate_host_count', using default
2012-07-02 16:40:42,857 INFO crawl.Generator - Generator: Partitioning
selected urls for politeness.
2012-07-02 16:40:43,858 INFO crawl.Generator - Generator: segment:
/root/arijit/crawler/crawl/segments/20120702164043
2012-07-02 16:40:44,036 INFO regex.RegexURLNormalizer - can't find rules for
scope 'partition', using default
2012-07-02 16:40:44,969 INFO crawl.Generator - Generator: finished at
2012-07-02 16:40:44, elapsed: 00:00:03
2012-07-02 16:41:05,715 INFO fetcher.Fetcher - Fetcher: starting at 2012-07-02
16:41:05
2012-07-02 16:41:05,715 INFO fetcher.Fetcher - Fetcher: segment:
/root/arijit/crawler/crawl/segments/20120702164043
2012-07-02 16:41:06,292 INFO fetcher.Fetcher - Using queue mode : byHost
2012-07-02 16:41:06,292 INFO fetcher.Fetcher - Fetcher: threads: 10
2012-07-02 16:41:06,292 INFO fetcher.Fetcher - Fetcher: time-out divisor: 2
2012-07-02 16:41:06,305 INFO plugin.PluginRepository - Plugins: looking in:
/root/arijit/apache-nutch-1.4-bin/runtime/local/plugins
2012-07-02 16:41:06,305 INFO fetcher.Fetcher - QueueFeeder finished: total 1
records + hit by time limit :0
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Registered Plugins:
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Tika Parser
Plug-in (parse-tika)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Anchor Indexing
Filter (index-anchor)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Registered
Extension-Points:
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Nutch Segment
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2012-07-02 16:41:06,467 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2012-07-02 16:41:06,523 INFO fetcher.Fetcher - Using queue mode : byHost
2012-07-02 16:41:06,530 INFO fetcher.Fetcher - Using queue mode : byHost
2012-07-02 16:41:06,531 INFO fetcher.Fetcher - fetching
http://en.wikipedia.org/wiki/Districts_of_India/
2012-07-02 16:41:06,535 INFO fetcher.Fetcher - Using queue mode : byHost
2012-07-02 16:41:06,535 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2012-07-02 16:41:06,536 INFO fetcher.Fetcher - Using queue mode : byHost
2012-07-02 16:41:06,536 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2012-07-02 16:41:06,540 INFO http.Http - http.proxy.host = null
2012-07-02 16:41:06,540 INFO http.Http - http.proxy.port = 8080
2012-07-02 16:41:06,540 INFO http.Http - http.timeout = 10000
2012-07-02 16:41:06,540 INFO http.Http - http.content.limit = 65536
2012-07-02 16:41:06,540 INFO http.Http - http.agent = My Nutch Spider/Nutch-1.4
2012-07-02 16:41:06,540 INFO http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-07-02 16:41:06,545 INFO fetcher.Fetcher - Using queue mode : byHost
2012-07-02 16:41:06,545 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2012-07-02 16:41:06,546 INFO fetcher.Fetcher - Using queue mode : byHost
2012-07-02 16:41:06,547 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2012-07-02 16:41:06,547 INFO fetcher.Fetcher - Using queue mode : byHost
2012-07-02 16:41:06,547 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2012-07-02 16:41:06,547 INFO fetcher.Fetcher - Using queue mode : byHost
2012-07-02 16:41:06,547 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2012-07-02 16:41:06,547 INFO fetcher.Fetcher - Using queue mode : byHost
2012-07-02 16:41:06,547 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2012-07-02 16:41:06,547 INFO fetcher.Fetcher - Using queue mode : byHost
2012-07-02 16:41:06,555 INFO fetcher.Fetcher - Fetcher: throughput threshold:
-1
2012-07-02 16:41:06,555 INFO fetcher.Fetcher - Fetcher: throughput threshold
retries: 5
2012-07-02 16:41:06,555 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2012-07-02 16:41:06,561 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots
rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots
rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots
rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
2012-07-02 16:41:07,555 INFO fetcher.Fetcher - -activeThreads=1,
spinWaiting=0, fetchQueues.totalSize=0
2012-07-02 16:41:08,556 INFO fetcher.Fetcher - -activeThreads=1,
spinWaiting=0, fetchQueues.totalSize=0
2012-07-02 16:41:08,613 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2012-07-02 16:41:09,556 INFO fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2012-07-02 16:41:09,556 INFO fetcher.Fetcher - -activeThreads=0
2012-07-02 16:41:09,653 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2012-07-02 16:41:10,092 INFO fetcher.Fetcher - Fetcher: finished at 2012-07-02
16:41:10, elapsed: 00:00:04
2012-07-02 16:42:39,351 INFO parse.ParseSegment - ParseSegment: starting at
2012-07-02 16:42:39
2012-07-02 16:42:39,352 INFO parse.ParseSegment - ParseSegment: segment:
/root/arijit/crawler/crawl/segments/20120702164043
2012-07-02 16:42:39,707 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2012-07-02 16:42:39,870 INFO plugin.PluginRepository - Plugins: looking in:
/root/arijit/apache-nutch-1.4-bin/runtime/local/plugins
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Registered Plugins:
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Tika Parser
Plug-in (parse-tika)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Anchor Indexing
Filter (index-anchor)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Registered
Extension-Points:
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Nutch Segment
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2012-07-02 16:42:40,095 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2012-07-02 16:42:40,096 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2012-07-02 16:42:40,096 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2012-07-02 16:42:40,096 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2012-07-02 16:42:40,096 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2012-07-02 16:42:40,663 INFO parse.ParseSegment - ParseSegment: finished at
2012-07-02 16:42:40, elapsed: 00:00:01
2012-07-02 16:42:45,884 INFO crawl.CrawlDb - CrawlDb update: starting at
2012-07-02 16:42:45
2012-07-02 16:42:45,884 INFO crawl.CrawlDb - CrawlDb update: db:
/root/arijit/crawler/crawl/crawldb
2012-07-02 16:42:45,885 INFO crawl.CrawlDb - CrawlDb update: segments:
[/root/arijit/crawler/crawl/segments/20120702164043]
2012-07-02 16:42:45,885 INFO crawl.CrawlDb - CrawlDb update: additions
allowed: true
2012-07-02 16:42:45,885 INFO crawl.CrawlDb - CrawlDb update: URL normalizing:
false
2012-07-02 16:42:45,885 INFO crawl.CrawlDb - CrawlDb update: URL filtering:
false
2012-07-02 16:42:45,885 INFO crawl.CrawlDb - CrawlDb update: 404 purging: false
2012-07-02 16:42:45,887 INFO crawl.CrawlDb - CrawlDb update: Merging segment
data into db.
2012-07-02 16:42:46,403 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2012-07-02 16:42:46,582 INFO plugin.PluginRepository - Plugins: looking in:
/root/arijit/apache-nutch-1.4-bin/runtime/local/plugins
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - Registered Plugins:
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - Tika Parser
Plug-in (parse-tika)
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2012-07-02 16:42:46,686 INFO plugin.PluginRepository - Anchor Indexing
Filter (index-anchor)
2012-07-02 16:42:46,687 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2012-07-02 16:42:46,687 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2012-07-02 16:42:46,687 INFO plugin.PluginRepository - Registered
Extension-Points:
2012-07-02 16:42:46,687 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-07-02 16:42:46,687 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2012-07-02 16:42:46,687 INFO plugin.PluginRepository - Nutch Segment
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2012-07-02 16:42:46,687 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2012-07-02 16:42:46,687 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2012-07-02 16:42:46,687 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2012-07-02 16:42:46,687 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2012-07-02 16:42:46,687 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2012-07-02 16:42:46,691 INFO crawl.FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-07-02 16:42:46,692 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-07-02 16:42:46,692 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2012-07-02 16:42:47,161 INFO crawl.CrawlDb - CrawlDb update: finished at
2012-07-02 16:42:47, elapsed: 00:00:01
2012-07-02 16:43:05,676 INFO segment.SegmentReader - SegmentReader: dump
segment: /root/arijit/crawler/crawl/segments/20120702164043
2012-07-02 16:43:05,716 WARN mapred.JobClient - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
2012-07-02 16:43:06,210 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2012-07-02 16:43:07,009 INFO segment.SegmentReader - SegmentReader: done
2012-07-02 17:04:22,551 INFO parse.ParserChecker - fetching:
http://en.wikipedia.org/wiki/Districts_of_India
2012-07-02 17:04:22,559 INFO plugin.PluginRepository - Plugins: looking in:
/root/arijit/apache-nutch-1.4-bin/runtime/local/plugins
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Registered Plugins:
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Tika Parser
Plug-in (parse-tika)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Anchor Indexing
Filter (index-anchor)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Registered
Extension-Points:
2012-07-02 17:04:22,676 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-07-02 17:04:22,677 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2012-07-02 17:04:22,677 INFO plugin.PluginRepository - Nutch Segment
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2012-07-02 17:04:22,677 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2012-07-02 17:04:22,677 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2012-07-02 17:04:22,677 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2012-07-02 17:04:22,677 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2012-07-02 17:04:22,677 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2012-07-02 17:04:22,684 INFO http.Http - http.proxy.host = null
2012-07-02 17:04:22,684 INFO http.Http - http.proxy.port = 8080
2012-07-02 17:04:22,684 INFO http.Http - http.timeout = 10000
2012-07-02 17:04:22,684 INFO http.Http - http.content.limit = 65536
2012-07-02 17:04:22,684 INFO http.Http - http.agent = My Nutch Spider/Nutch-1.4
2012-07-02 17:04:22,684 INFO http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-07-02 17:04:24,741 INFO parse.ParserChecker - parsing:
http://en.wikipedia.org/wiki/Districts_of_India
2012-07-02 17:04:24,741 INFO parse.ParserChecker - contentType:
application/xhtml+xml
======================== hadoop.log
===========================================================
I do not have any exotic rule in regex-urlfilter.txt - it just +^.* - so, do
not believe that this will filter outlinks out.
-Arijit
> error parsing robots rules- can't decode path:
> /wiki/Wikipedia%3Mediation_Committee/
> ------------------------------------------------------------------------------------
>
> Key: NUTCH-1418
> URL: https://issues.apache.org/jira/browse/NUTCH-1418
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Arijit Mukherjee
>
> Since learning that nutch will be unable to crawl the javascript function
> calls in href, I started looking for other alternatives. I decided to crawl
> http://en.wikipedia.org/wiki/Districts_of_India.
> I first tried injecting this URL and follow the step-by-step approach
> till fetcher - when I realized, nutch did not fetch anything from this
> website. I tried looking into logs/hadoop.log and found the following 3 lines
> - which I believe could be saying that nutch is unable to parse the
> robots.txt in the website and ttherefore, fetcher stopped?
>
> 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots
> rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
> 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots
> rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
> 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots
> rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
> I tried checking the URL using parsechecker and no issues there! I think
> it means that the robots.txt is malformed for this website, which is
> preventing fetcher from fetching anything. Is there a way to get around this
> problem, as parsechecker seems to go on its merry way parsing.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira