Re: [Nutch-general] RSS-fecter and index individul-how can i realize this function

Renaud Richardet Mon, 05 Feb 2007 13:40:56 -0800

kauu wrote:
> I've change code like what u said, but i get an exception like this.
> why, why is the MD5Signature class's exception
Actually, I think it's a NullPointerException in
ParseOutputFormat.java:121...


I would suggest you try the approach that Doug and Dog(acan are
discussing, seems much faster and cleaner.

My problem was that many blogs do not publish full-text feeds and I need
to actually fetch the blog-post page, and match its DOM against the feed
text.

HTH,
Renaud

>
> 2007-02-05 11:28:38,453 WARN feedparser.FeedFilter
> (FeedFilter.java:doDecodeEntities (223)) - Filter encountered unknown
> entities
> 2007-02-05 11:28:39,390 INFO crawl.SignatureFactory
> (SignatureFactory.java:getSignature(45)) - Using Signature impl:
> org.apache.nutch.crawl.MD5Signature
> 2007-02-05 11:28:40,078 WARN mapred.LocalJobRunner
> (LocalJobRunner.java:run(120)) - job_f6j55m
> java.lang.NullPointerException
> at
> org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:121)
> at org.apache.nutch.fetcher.FetcherOutputFormat$1.write
> (FetcherOutputFormat.java:87)
> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:235)
> at
> org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
> at org.apache.hadoop.mapred.ReduceTask.run (ReduceTask.java:247)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112)
>
>
> On 2/3/07, *Renaud Richardet* < [EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     Gal, Chris, Kauu,
>
>     So, if I understand correctly, you need a way to pass information
>     along
>     the fetches, so that when Nutch fetches a feed entry, its <item> value
>     previously fetched is available.
>
>     This is how I tackled the issue:
>     - extend Outlinks.java and allow to create outlinks with more meta
>     data.
>     So, in your feed parser, use this way to create outlinks
>     - pass on the metadata through ParseOutputFormat.java and Fetcher.java
>     - retrieve the metadata in HtmlParser.java and use it
>
>     This is very tedious, will blow the size of your outlinks db, makes
>     changes in the core code of Nutch, etc... But this is the only way I
>     came up with...
>     If someone sees a better way, please let me know :-)
>
>     Sample code, for Nutch 0.8.x :
>
>     Outlink.java
>     + public Outlink(String toUrl, String anchor, String entryContents,
>     Configuration conf) throws MalformedURLException {
>     + this.toUrl = new
>     UrlNormalizerFactory(conf).getNormalizer().normalize(toUrl);
>     + this.anchor = anchor;
>     +
>     + this.entryContents= entryContents;
>     + }
>     and update the other methods
>
>     ParseOutputFormat.java, around lines 140
>     + // set outlink info in metadata ME
>     + String entryContents= links[i].getEntryContents();
>     +
>     + if ( entryContents.length() > 0) { // it's a feed entry
>     + MapWritable meta = new MapWritable();
>     + meta.put(new UTF8("entryContents"), new
>     UTF8(entryContents));//key/value
>     + target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
>     interval);
>     + target.setMetaData(meta);
>     + } else {
>     + target = new CrawlDatum(CrawlDatum.STATUS_LINKED ,
>     interval); // no meta
>     + }
>
>     Fetcher.java, around l. 266
>     + // add feed info to metadata
>     + try {
>     + String entryContents = datum.getMetaData().get(new
>     UTF8("entryContents")).toString();
>     + metadata.set("entryContents", entryContents);
>     + } catch (Exception e) { } //not found
>
>     HtmlParser.java
>     // get entry metadata
>     String entryContents = content.getMetadata().get("entryContents");
>
>     HTH,
>     Renaud
>
>
>
>     Gal Nitzan wrote:
>     > Hi Chris,
>     >
>     > I'm sorry I wasn't clear enough. What I mean is that in the
>     current implementation:
>     >
>     > 1. The RSS (channels, items) page ends up as one Lucene document
>     in the index.
>     > 2. Indeed the links are extracted and each <item> link will be
>     fetched in the next fetch as a separate page and will end up as
>     one Lucene document.
>     >
>     > IMHO the data that is needed i.e. the data that will be fetched
>     in the next fetch process is already available in the <item>
>     element. Each <item> element represents one web resource. And
>     there is no reason to go to the server and re-fetch that resource.
>     >
>     > Another issue that arises from rss feeds is that once the feed
>     page is fetched you can not re-fetch it until its "time to fetch"
>     expired. The feeds TTL is usually very short. Since for now in
>     Nutch, all pages created equal :) it is one more thing to think
>     about.
>     >
>     > HTH,
>     >
>     > Gal.
>     >
>     > -----Original Message-----
>     > From: Chris Mattmann [mailto:[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>]
>     > Sent: Thursday, February 01, 2007 7:01 PM
>     > To: [email protected] <mailto:[email protected]>
>     > Subject: Re: RSS-fecter and index individul-how can i realize
>     this function
>     >
>     > Hi Gal, et al.,
>     >
>     > I'd like to be explicit when we talk about what the issue with
>     the RSS
>     > parsing plugin is here; I think we have had conversations
>     similar to this
>     > before and it seems that we keep talking around each other. I'd
>     like to get
>     > to the heart of this matter so that the issue (if there is an
>     actual one)
>     > gets addressed ;)
>     >
>     > Okay, so you mention below that the thing that you see missing
>     from the
>     > current RSS parsing plugin is the ability to store data in the
>     CrawlDatum,
>     > and parse "it" in the next fetch phase. Well, there are 2
>     options here for
>     > what you refer to as "it":
>     >
>     > 1. If you're talking about the RSS file, then in fact, it is
>     parsed, and
>     > its data is stored in the CrawlDatum, akin to any other form of
>     content that
>     > is fetched, parsed and indexed.
>     >
>     > 2. If you're talking about the item links within the RSS file,
>     in fact,
>     > they are parsed (eventually), and their data stored in the
>     CrawlDatum, akin
>     > to any other form of content that is fetched, parsed, and
>     indexed. This is
>     > accomplished by adding the RSS items as Outlinks when the RSS
>     file is
>     > parsed: in this fashion, we go after all of the links in the RSS
>     file, and
>     > make sure that we index their content as well.
>     >
>     > Thus, if you had an RSS file R that contained links in it to a
>     PDF file A,
>     > and another HTML page P, then not only would R get fetched,
>     parsed, and
>     > indexed, but so would A and P, because they are item links
>     within R. Then
>     > queries that would match R (the physical RSS file), would
>     additionally match
>     > things such as P and A, and all 3 would be capable of being
>     returned in a
>     > Nutch query. Does this make sense? Is this the issue that you're
>     talking
>     > about? Am I nuts? ;)
>     >
>     > Cheers,
>     > Chris
>     >
>     >
>     >
>     >
>     > On 1/31/07 10:40 PM, "Gal Nitzan" < [EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
>     >
>     >
>     >> Hi,
>     >>
>     >> Many sites provide RSS feeds for several reasons, usually to
>     save bandwidth,
>     >> to give the users concentrated data and so forth.
>     >>
>     >> Some of the RSS files supplied by sites are created specially
>     for search
>     >> engines where each RSS "item" represent a web page in the site.
>     >>
>     >> IMHO the only thing "missing" in the parse-rss plugin is
>     storing the data in
>     >> the CrawlDatum and "parsing" it in the next fetch phase. Maybe
>     adding a new
>     >> flag to CrawlDatum, that would flag the URL as "parsable" not
>     "fetchable"?
>     >>
>     >> Just my two cents...
>     >>
>     >> Gal.
>     >>
>     >> -----Original Message-----
>     >> From: Chris Mattmann [mailto:[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>]
>     >> Sent: Wednesday, January 31, 2007 8:44 AM
>     >> To: [email protected]
>     <mailto:[email protected]>
>     >> Subject: Re: RSS-fecter and index individul-how can i realize
>     this function
>     >>
>     >> Hi there,
>     >>
>     >> With the explanation that you give below, it seems like
>     parse-rss as it
>     >> exists would address what you are trying to do. parse-rss
>     parses an RSS
>     >> channel as a set of items, and indexes overall metadata about
>     the RSS file,
>     >> including parse text, and index data, but it also adds each
>     item (in the
>     >> channel)'s URL as an Outlink, so that Nutch will process those
>     pieces of
>     >> content as well. The only thing that you suggest below that
>     parse-rss
>     >> currently doesn't do, is to allow you to associate the metadata
>     fields
>     >> category:, and author: with the item Outlink...
>     >>
>     >> Cheers,
>     >> Chris
>     >>
>     >>
>     >>
>     >> On 1/30/07 7:30 PM, "kauu" <[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
>     >>
>     >>
>     >>> thx for ur reply .
>     >>>
>     >> mybe i didn't tell clearly .
>     >> I want to index the item as a
>     >>
>     >>> individual page .then when i search the some
>     >>>
>     >> thing for example "nutch-open
>     >>
>     >>> source", the nutch return a hit which contain
>     >>>
>     >> title : nutch-open source
>     >>
>     >>
>     >>> description : nutch nutch nutch ....nutch nutch
>     >>>
>     >> url :
>     >>
>     >>> http://lucene.apache.org/nutch
>     >>>
>     >> category : news
>     >> author : kauu
>     >>
>     >> so , is
>     >>
>     >>> the plugin parse-rss can satisfy what i need?
>     >>>
>     >> <item>
>     >> <title>nutch--open
>     >>
>     >>> source</title>
>     >>>
>     >> <description>
>     >>
>     >>> nutch nutch nutch ....nutch
>     >>> nutch
>     >>>
>     >>>> </description>
>     >>>>
>     >>>>
>     >>>>
>     >>>>
>     >>> <link> http://lucene.apache.org/nutch</link>
>     >>>
>     >>>> <category>news
>     >>>>
>     >>> </category>
>     >>>
>     >>>> <author>kauu</author>
>     >>>>
>     >>
>     >> On 1/31/07, Chris
>     >>
>     >>> Mattmann <[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
>     >>>
>     >>> Hi there,
>     >>>
>     >>> I could most
>     >>> likely be of assistance, if you gave me some more
>     >>> information.
>     >>> For
>     >>> instance: I'm wondering if the use case you describe below is
>     already
>     >>>
>     >>> supported by the current RSS parse plugin?
>     >>>
>     >>> The current RSS parser,
>     >>> parse-rss, does in fact index individual items
>     >>> that
>     >>> are pointed to by an
>     >>> RSS document. The items are added as Nutch Outlinks,
>     >>> and added to the
>     >>> overall queue of URLs to fetch. Doesn't this satisfy what
>     >>> you mention below?
>     >>> Or am I missing something?
>     >>>
>     >>> Cheers,
>     >>> Chris
>     >>>
>     >>>
>     >>>
>     >>> On 1/30/07 6:01 PM,
>     >>> "kauu" <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>     >>>
>     >>>
>     >>>> Hi folks :
>     >>>>
>     >>>> What's I want to
>     >>>>
>     >>> do is to separate a rss file into several pages .
>     >>>
>     >>>> Just as what has
>     >>>>
>     >>> been discussed before. I want fetch a rss page and
>     >>> index
>     >>>
>     >>>> it as different
>     >>>>
>     >>> documents in the index. So the searcher can search the
>     >>>
>     >>>> Item's info as a
>     >>>>
>     >>> individual hit.
>     >>>
>     >>>> What's my opinion create a protocol for fetch the rss
>     >>>>
>     >>> page and store it
>     >>> as
>     >>>
>     >>>> several one which just contain one ITEM tag .but
>     >>>>
>     >>> the unique key is the
>     >>> url ,
>     >>>
>     >>>> so how can I store them with the ITEM's link
>     >>>>
>     >>> tag as the unique key for a
>     >>>
>     >>>> document.
>     >>>>
>     >>>> So my question is how to
>     >>>>
>     >>> realize this function in nutch-.0.8.x.
>     >>>
>     >>>> I've check the code of the
>     >>>>
>     >>> plug-in protocol-http's code ,but I can't
>     >>>
>     >>>> find the code where to store a
>     >>>>
>     >>> page to a document. I want to separate
>     >>> the
>     >>>
>     >>>> rss page to several ones
>     >>>>
>     >>> before storing it as a document but several
>     >>> ones.
>     >>>
>     >>>> So any one can
>     >>>>
>     >>> give me some hints?
>     >>>
>     >>>> Any reply will be appreciated !
>     >>>>
>     >>>>
>     >>>>
>     >>>>
>     >>>>
>     >>>> ITEM's structure
>     >>>>
>     >>>> <item>
>     >>>>
>     >>>>
>     >>>> <title>欧洲暴风雪后发制人 致航班
>     >>>>
>     >>> 延误交通混乱(组图)</title>
>     >>>
>     >>>> <description>暴风雪横扫欧洲，导致多次航班延误 1
>     >>>>
>     >>> 月24日，几架民航客机在德
>     >>>
>     >>>> 国斯图加特机场内等待去除机身上冰雪。1月24日，工作人员在德国南部
>     >>>>
>     >>> 的慕尼黑机场
>     >>>
>     >>>> 清扫飞机跑道上的积雪。 据报道，迟来的暴风雪连续两天横扫中...
>     >>>>
>     >>>>
>     >>>> </description>
>     >>>>
>     >>>>
>     >>>>
>     >>>>
>     >>> <link> http://news.sohu.com/20070125
>     >>>
>     >>> <http://news.sohu.com/20070125/n247833568.shtml>
>     /n247833568.shtml</
>     >>>
>     >>> link>
>     >>>
>     >>>> <category>搜狐焦点图新闻</category>
>     >>>>
>     >>>>
>     >>>>
>     >>>>
>     >>> <author> [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>     >>>
>     >>>> </author>
>     >>>>
>     >>>>
>     >>>> <pubDate>Thu, 25 Jan 2007
>     >>>>
>     >>> 11:29:11 +0800</pubDate>
>     >>>
>     >>>> <comments
>     >>>>
>     >>> http://comment.news.sohu.com
>     >>>
>     >>> < http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
>     >>>
>     >>> /comment/topic.jsp?id=247833847</comments>
>     >>>
>     >>>> </item
>     >>>>
>     >>>>
>     >>>>
>     >>>
>     >>>
>     >
>     > ______________________________________________
>     > Chris A. Mattmann
>     > [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>     > Staff Member
>     > Modeling and Data Management Systems Section (387)
>     > Data Management Systems and Technologies Group
>     >
>     > _________________________________________________
>     > Jet Propulsion Laboratory Pasadena, CA
>     > Office: 171-266B Mailstop: 171-246
>     > _______________________________________________________
>     >
>     > Disclaimer: The opinions presented within are my own and do not
>     reflect
>     > those of either NASA, JPL, or the California Institute of
>     Technology.
>     >
>     >
>     >
>     >
>     >
>     >
>
>
>     --
>     renaud richardet +1 617 230 9112
>     renaud <at> oslutions.com <http://oslutions.com>
>     http://www.oslutions.com
>
>
>
>
> -- 
> www.babatu.com <http://www.babatu.com> 


-- 
renaud richardet                           +1 617 230 9112
renaud <at> oslutions.com         http://www.oslutions.com

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] RSS-fecter and index individul-how can i realize this function

Reply via email to