[Nutch-general] IOException using feed plugin - NUTCH-444

Kai_testing Middleton Thu, 28 Jun 2007 16:21:48 -0700

I have tried the NUTCH-444 "feed" plugin to enable spidering of RSS feeds:
/nutch-2007-06-27_06-52-44/plugins/feed
(that's a recent nightly build of nutch).


When I attempt a crawl I get an IOException:

$ nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2
crawl started in: /usr/tmp/lee_apollo
rootUrlDir = /usr/tmp/lee_urls.txt
threads = 10
depth = 2
Injector: starting
Injector: crawlDb: /usr/tmp/lee_apollo/crawldb
Injector: urlDir: /usr/tmp/lee_urls.txt
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
        3.14 real         1.92 user         0.30 sys

The seed URL is:
http://www.mt-olympus.com/apollo/feed/

I enabled the feed plugin via this property in nutch-site.xml:
<property>
  <name>plugin.includes</name>
  
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opi
c|urlnormalizer-(pass|regex|basic)|feed</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>


As a sanity check, when I take out "feed" from <value> above, it no longer 
throws an exception (but it also doesn't fetch anything):

$ nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 2>&1 | 
tee crawl.log
crawl started in: /usr/tmp/lee_apollo
rootUrlDir = /usr/tmp/lee_urls.txt
threads = 10
depth = 2
Injector: starting
Injector: crawlDb: /usr/tmp/lee_apollo/crawldb
Injector: urlDir: /usr/tmp/lee_urls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: /usr/tmp/lee_apollo/segments/20070628155854
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: /usr/tmp/lee_apollo/segments/20070628155854
Fetcher: threads: 10
fetching http://www.mt-olympus.com/apollo/feed/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: /usr/tmp/lee_apollo/crawldb
CrawlDb update: segments: [/usr/tmp/lee_apollo/segments/20070628155854]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: /usr/tmp/lee_apollo/segments/20070628155907
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: /usr/tmp/lee_apollo/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: /usr/tmp/lee_apollo/segments/20070628155854
LinkDb: done
Indexer: starting
Indexer: linkdb: /usr/tmp/lee_apollo/linkdb
Indexer: adding segment: /usr/tmp/lee_apollo/segments/20070628155854
 Indexing [http://www.mt-olympus.com/apollo/feed/] with analyzer [EMAIL 
PROTECTED] (null)
Optimizing index.
merging segments _ram_0 (1 docs) into _0 (1 docs)
[EMAIL PROTECTED] Thread-36: now checkpoint "segments_2" [isCommit = true]
[EMAIL PROTECTED] Thread-36:   IncRef "_0.fnm": pre-incr count is 0
[EMAIL PROTECTED] Thread-36:   IncRef "_0.fdx": pre-incr count is 0
[EMAIL PROTECTED] Thread-36:   IncRef "_0.fdt": pre-incr count is 0
[EMAIL PROTECTED] Thread-36:   IncRef "_0.tii": pre-incr count is 0
[EMAIL PROTECTED] Thread-36:   IncRef "_0.tis": pre-incr count is 0
[EMAIL PROTECTED] Thread-36:   IncRef "_0.frq": pre-incr count is 0
[EMAIL PROTECTED] Thread-36:   IncRef "_0.prx": pre-incr count is 0
[EMAIL PROTECTED] Thread-36:   IncRef "_0.nrm": pre-incr count is 0
[EMAIL PROTECTED] Thread-36: deleteCommits: now remove commit "segments_1"
[EMAIL PROTECTED] Thread-36:   DecRef "segments_1": pre-decr count is 1
[EMAIL PROTECTED] Thread-36: delete "segments_1"
Indexer: done
Dedup: starting
Dedup: adding indexes in: /usr/tmp/lee_apollo/indexes
Dedup: done
merging indexes to: /usr/tmp/lee_apollo/index
Adding /usr/tmp/lee_apollo/indexes/part-00000
done merging
crawl finished: /usr/tmp/lee_apollo
       30.45 real         8.40 user         2.26 sys


----- Original Message ----
From: Doğacan Güney <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Wednesday, June 27, 2007 10:59:52 PM
Subject: Re: Possibly use a different library to parse RSS feed for improved 
performance and compatibility

On 6/28/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:
> I am choosing to use NUTCH-444 for my RSS functionality.  Doğacan commented 
> on how to do this; he wrote:
>     ...if you need the functionality of NUTCH-444, I would suggest
>     trying a nightly version of Nutch. Becase NUTCH-444 by itself is not
>     enough. You also need two patches from NUTCH-443 and probably
>     NUTCH-504.
>
> I have a couple newbie questions about the mechanics of installing this.
>
> Prefatory comments: I have already installed another patch (for NUTCH-505) so 
> I think I already have a nightly build (I'm guessing trunk==nightly?).  These 
> were the steps I did:
> $ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
> $ cd nutch
> $ wget 
> https://issues.apache.org/jira/secure/attachment/12360411/NUTCH-505_draft_v2.patch
> $ patch -p0 < NUTCH-505_draft_v2.patch
> $ ant clean && ant
>
> ---
>
> Now I need NUTCH-443 NUTCH-504 NUTCH-444.  Here's my guess:
>
> $ cd nutch
>
> $ wget 
> http://issues.apache.org/jira/secure/attachment/12359953/NUTCH_443_reopened_v3.patch
> $ patch -p0 < NUTCH_443_reopened_v3.patch
> $ wget 
> http://issues.apache.org/jira/secure/attachment/12350644/parse-map-core-draft-v1.patch
> $ patch -p0 < parse-map-core-draft-v1.patch
> $ wget 
> http://issues.apache.org/jira/secure/attachment/12350634/parse-map-core-untested.patch
> $ patch -p0 < parse-map-core-untested.patch
> $ wget 
> http://issues.apache.org/jira/secure/attachment/12357183/redirect_and_index.patch
>
> $ patch -p0 < redirect_and_index.patch
>
>
> $ wget 
> http://issues.apache.org/jira/secure/attachment/12357300/redirect_and_index_v2.patch
>
> $ patch -p0 < redirect_and_index_v2.patch
>
> I'm really guessing on the above ... continuing:
>
> $ wget 
> http://issues.apache.org/jira/secure/attachment/12360361/NUTCH-504_v2.patch
>
> $ patch -p0 < NUTCH-504_v2.patch
>
> $ wget 
> http://issues.apache.org/jira/secure/attachment/12360348/parse_in_fetchers.patch
>
> $ patch -p0 < parse_in_fetchers.patch
>
> ... that felt like less of a guess, but now:
>
>
> $ wget 
> http://issues.apache.org/jira/secure/attachment/12357192/NUTCH-444.patch
>
> $ patch -p0 < NUTCH-444.patch
>
> $ wget 
> http://issues.apache.org/jira/secure/attachment/12350820/parse-feed.tar.bz2
>
> $ tar xjvf parse-feed.tar.bz2
>
> what do I do with this newly created parse-feed directory?
>
> so then I would do:
>
> $ ant clean && ant
>
>
> Wait a minute:  do I have this whole thing wrong?  Maybe Doğacan means that 
> the nightly builds ALREADY contain NUTCH-443 and NUTCH-504 so that I would do 
> this:
>
>
> $ wget 
> http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild/artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz
> $ tar xvzf nutch-2007-06-27_06-52-44.tar.gz
> $ cd nutch-2007-06-27_06-52-44
>
> then this business:
>
> $ wget 
> http://issues.apache.org/jira/secure/attachment/12357192/NUTCH-444.patch
>
>
> $ patch -p0 < NUTCH-444.patch
>
>
> $ wget 
> http://issues.apache.org/jira/secure/attachment/12350820/parse-feed.tar.bz2
>
>
> $ tar xjvf parse-feed.tar.bz2
>
>
>
> what do I do with this newly created parse-feed directory?
>
>
>
> so then I would do:
>
>
>
> $ ant clean && ant
>
> I guess this is why "release engineer" is a job in and of itself!
> Please advise.

If you downloaded nightly build of 27th June, it contains feed plugin
already (the plugin is called "feed", not "parse-feed", parse-feed was
an older plugin and it is never committed. In my earlier comment, I
meant to write parse-rss but wrote parse-feed). So, you don't have to
apply any patches or anything. Just download a recent nightly build,
and you are good to go :).

You can also checkout trunk from svn and it will work too.

>
> --Kai Middleton
>
> ----- Original Message ----
> From: Doğacan Güney <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Friday, June 22, 2007 1:39:12 AM
> Subject: Re: Possibly use a different library to parse RSS feed for improved 
> performance and compatibility
>
> On 6/21/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:
> > I am a new nutch user and the ability to crawl RSS feeds is critical to my 
> > mission.  Do I understand from this (lengthy) discussion that in order to 
> > get the new RSS I need to either a) download one of the nightly builds and 
> > run ant or b) download and apply a patch (NUTCH-444.patch, I gather).
>
> Nutch 0.9 can already parse RSS feeds (via parse-feed) plugin.
> However, if you need the functionality of NUTCH-444, I would suggest
> trying a nightly version of Nutch. Becase NUTCH-444 by itself is not
> enough. You also need two patches from NUTCH-443 and probably
> NUTCH-504. If you are worrying about stability, nightlies of nutch are
> generally pretty stable.
>
> --
> Doğacan Güney
>
>
>
>
> x
> x
> x
> x
> x
>
>
>
>
>
> ____________________________________________________________________________________
> Get the Yahoo! toolbar and be alerted to new email wherever you're surfing.
> http://new.toolbar.yahoo.com/toolbar/features/mail/index.php


-- 
Doğacan Güney







 
____________________________________________________________________________________
Be a PS3 game guru.
Get your game face on with the latest PS3 news and previews at Yahoo! Games.
http://videogames.yahoo.com/platform?platform=120121

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] IOException using feed plugin - NUTCH-444

Reply via email to