[jira] Created: (NUTCH-852) parser not found for contentType=application/xhtml+xml

Pham Tuan Minh (JIRA) Tue, 13 Jul 2010 12:30:19 -0700

parser not found for contentType=application/xhtml+xml
------------------------------------------------------


                 Key: NUTCH-852
                 URL: https://issues.apache.org/jira/browse/NUTCH-852
             Project: Nutch
          Issue Type: Bug
         Environment: window XP sp3, cygwin
            Reporter: Pham Tuan Minh
             Fix For: 2.0


I config nutch trunk to crawl sample site (http://www.lucidimagination.com/), 
then it post to solr server for indexing, however, I got following error. It 
seems tika parser is not working properly or tika libraries is not recognized!
----------------------
$ bin/nutch-local crawl urls -solr http://127.0.0.1:8983/solr/ -dir crawl 
-depth 3 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=http://127.0.0.1:8983/solr/
topN = 50
Injector: starting at 2010-07-14 02:08:20
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2010-07-14 02:08:31, elapsed: 00:00:11
Generator: starting at 2010-07-14 02:08:32
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20100714020838
Generator: finished at 2010-07-14 02:08:42, elapsed: 00:00:10
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.age
nts' property.
Fetcher: starting at 2010-07-14 02:08:42
Fetcher: segment: crawl/segments/20100714020838
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
fetching http://www.lucidimagination.com/
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=5

-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=9
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
Error parsing: http://www.lucidimagination.com/: 
org.apache.nutch.parse.ParseException: parser not found for 
contentType=application/xhtml+xml url=http://www.lucidimagination.com/
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:879)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:647)

-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2010-07-14 02:08:54, elapsed: 00:00:12
CrawlDb update: starting at 2010-07-14 02:08:54
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20100714020838]
CrawlDb update: additions allowed: true
$
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2010-07-14 02:09:01, elapsed: 00:00:07
$
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2010-07-14 02:09:06
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: 
file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714014136
LinkDb: adding segment: 
file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714015544
LinkDb: adding segment: 
file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020206
LinkDb: adding segment: 
file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020232
LinkDb: adding segment: 
file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020838
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2010-07-14 02:09:19, elapsed: 00:00:12
SolrIndexer: starting at 2010-07-14 02:09:19
SolrIndexer: finished at 2010-07-14 02:09:36, elapsed: 00:00:17
SolrDeleteDuplicates: starting at 2010-07-14 02:09:41
SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
SolrDeleteDuplicates: finished at 2010-07-14 02:09:45, elapsed: 00:00:04
crawl finished: crawl
----------------------

Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-852) parser not found for contentType=application/xhtml+xml

Reply via email to