kauu wrote:
> I've change code like what u said, but i get an exception like this.
> why, why is the MD5Signature class's exception
Actually, I think it's a NullPointerException in
ParseOutputFormat.java:121...
I would suggest you try the approach that Doug and Dog(acan are
discussing, seems much faster and cleaner.
My problem was that many blogs do not publish full-text feeds and I need
to actually fetch the blog-post page, and match its DOM against the feed
text.
HTH,
Renaud
>
> 2007-02-05 11:28:38,453 WARN feedparser.FeedFilter
> (FeedFilter.java:doDecodeEntities (223)) - Filter encountered unknown
> entities
> 2007-02-05 11:28:39,390 INFO crawl.SignatureFactory
> (SignatureFactory.java:getSignature(45)) - Using Signature impl:
> org.apache.nutch.crawl.MD5Signature
> 2007-02-05 11:28:40,078 WARN mapred.LocalJobRunner
> (LocalJobRunner.java:run(120)) - job_f6j55m
> java.lang.NullPointerException
> at
> org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:121)
> at org.apache.nutch.fetcher.FetcherOutputFormat$1.write
> (FetcherOutputFormat.java:87)
> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:235)
> at
> org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
> at org.apache.hadoop.mapred.ReduceTask.run (ReduceTask.java:247)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112)
>
>
> On 2/3/07, *Renaud Richardet* < [EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
> Gal, Chris, Kauu,
>
> So, if I understand correctly, you need a way to pass information
> along
> the fetches, so that when Nutch fetches a feed entry, its <item> value
> previously fetched is available.
>
> This is how I tackled the issue:
> - extend Outlinks.java and allow to create outlinks with more meta
> data.
> So, in your feed parser, use this way to create outlinks
> - pass on the metadata through ParseOutputFormat.java and Fetcher.java
> - retrieve the metadata in HtmlParser.java and use it
>
> This is very tedious, will blow the size of your outlinks db, makes
> changes in the core code of Nutch, etc... But this is the only way I
> came up with...
> If someone sees a better way, please let me know :-)
>
> Sample code, for Nutch 0.8.x :
>
> Outlink.java
> + public Outlink(String toUrl, String anchor, String entryContents,
> Configuration conf) throws MalformedURLException {
> + this.toUrl = new
> UrlNormalizerFactory(conf).getNormalizer().normalize(toUrl);
> + this.anchor = anchor;
> +
> + this.entryContents= entryContents;
> + }
> and update the other methods
>
> ParseOutputFormat.java, around lines 140
> + // set outlink info in metadata ME
> + String entryContents= links[i].getEntryContents();
> +
> + if ( entryContents.length() > 0) { // it's a feed entry
> + MapWritable meta = new MapWritable();
> + meta.put(new UTF8("entryContents"), new
> UTF8(entryContents));//key/value
> + target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
> interval);
> + target.setMetaData(meta);
> + } else {
> + target = new CrawlDatum(CrawlDatum.STATUS_LINKED ,
> interval); // no meta
> + }
>
> Fetcher.java, around l. 266
> + // add feed info to metadata
> + try {
> + String entryContents = datum.getMetaData().get(new
> UTF8("entryContents")).toString();
> + metadata.set("entryContents", entryContents);
> + } catch (Exception e) { } //not found
>
> HtmlParser.java
> // get entry metadata
> String entryContents = content.getMetadata().get("entryContents");
>
> HTH,
> Renaud
>
>
>
> Gal Nitzan wrote:
> > Hi Chris,
> >
> > I'm sorry I wasn't clear enough. What I mean is that in the
> current implementation:
> >
> > 1. The RSS (channels, items) page ends up as one Lucene document
> in the index.
> > 2. Indeed the links are extracted and each <item> link will be
> fetched in the next fetch as a separate page and will end up as
> one Lucene document.
> >
> > IMHO the data that is needed i.e. the data that will be fetched
> in the next fetch process is already available in the <item>
> element. Each <item> element represents one web resource. And
> there is no reason to go to the server and re-fetch that resource.
> >
> > Another issue that arises from rss feeds is that once the feed
> page is fetched you can not re-fetch it until its "time to fetch"
> expired. The feeds TTL is usually very short. Since for now in
> Nutch, all pages created equal :) it is one more thing to think
> about.
> >
> > HTH,
> >
> > Gal.
> >
> > -----Original Message-----
> > From: Chris Mattmann [mailto:[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>]
> > Sent: Thursday, February 01, 2007 7:01 PM
> > To: [email protected] <mailto:[email protected]>
> > Subject: Re: RSS-fecter and index individul-how can i realize
> this function
> >
> > Hi Gal, et al.,
> >
> > I'd like to be explicit when we talk about what the issue with
> the RSS
> > parsing plugin is here; I think we have had conversations
> similar to this
> > before and it seems that we keep talking around each other. I'd
> like to get
> > to the heart of this matter so that the issue (if there is an
> actual one)
> > gets addressed ;)
> >
> > Okay, so you mention below that the thing that you see missing
> from the
> > current RSS parsing plugin is the ability to store data in the
> CrawlDatum,
> > and parse "it" in the next fetch phase. Well, there are 2
> options here for
> > what you refer to as "it":
> >
> > 1. If you're talking about the RSS file, then in fact, it is
> parsed, and
> > its data is stored in the CrawlDatum, akin to any other form of
> content that
> > is fetched, parsed and indexed.
> >
> > 2. If you're talking about the item links within the RSS file,
> in fact,
> > they are parsed (eventually), and their data stored in the
> CrawlDatum, akin
> > to any other form of content that is fetched, parsed, and
> indexed. This is
> > accomplished by adding the RSS items as Outlinks when the RSS
> file is
> > parsed: in this fashion, we go after all of the links in the RSS
> file, and
> > make sure that we index their content as well.
> >
> > Thus, if you had an RSS file R that contained links in it to a
> PDF file A,
> > and another HTML page P, then not only would R get fetched,
> parsed, and
> > indexed, but so would A and P, because they are item links
> within R. Then
> > queries that would match R (the physical RSS file), would
> additionally match
> > things such as P and A, and all 3 would be capable of being
> returned in a
> > Nutch query. Does this make sense? Is this the issue that you're
> talking
> > about? Am I nuts? ;)
> >
> > Cheers,
> > Chris
> >
> >
> >
> >
> > On 1/31/07 10:40 PM, "Gal Nitzan" < [EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
> >
> >
> >> Hi,
> >>
> >> Many sites provide RSS feeds for several reasons, usually to
> save bandwidth,
> >> to give the users concentrated data and so forth.
> >>
> >> Some of the RSS files supplied by sites are created specially
> for search
> >> engines where each RSS "item" represent a web page in the site.
> >>
> >> IMHO the only thing "missing" in the parse-rss plugin is
> storing the data in
> >> the CrawlDatum and "parsing" it in the next fetch phase. Maybe
> adding a new
> >> flag to CrawlDatum, that would flag the URL as "parsable" not
> "fetchable"?
> >>
> >> Just my two cents...
> >>
> >> Gal.
> >>
> >> -----Original Message-----
> >> From: Chris Mattmann [mailto:[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>]
> >> Sent: Wednesday, January 31, 2007 8:44 AM
> >> To: [email protected]
> <mailto:[email protected]>
> >> Subject: Re: RSS-fecter and index individul-how can i realize
> this function
> >>
> >> Hi there,
> >>
> >> With the explanation that you give below, it seems like
> parse-rss as it
> >> exists would address what you are trying to do. parse-rss
> parses an RSS
> >> channel as a set of items, and indexes overall metadata about
> the RSS file,
> >> including parse text, and index data, but it also adds each
> item (in the
> >> channel)'s URL as an Outlink, so that Nutch will process those
> pieces of
> >> content as well. The only thing that you suggest below that
> parse-rss
> >> currently doesn't do, is to allow you to associate the metadata
> fields
> >> category:, and author: with the item Outlink...
> >>
> >> Cheers,
> >> Chris
> >>
> >>
> >>
> >> On 1/30/07 7:30 PM, "kauu" <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
> >>
> >>
> >>> thx for ur reply .
> >>>
> >> mybe i didn't tell clearly .
> >> I want to index the item as a
> >>
> >>> individual page .then when i search the some
> >>>
> >> thing for example "nutch-open
> >>
> >>> source", the nutch return a hit which contain
> >>>
> >> title : nutch-open source
> >>
> >>
> >>> description : nutch nutch nutch ....nutch nutch
> >>>
> >> url :
> >>
> >>> http://lucene.apache.org/nutch
> >>>
> >> category : news
> >> author : kauu
> >>
> >> so , is
> >>
> >>> the plugin parse-rss can satisfy what i need?
> >>>
> >> <item>
> >> <title>nutch--open
> >>
> >>> source</title>
> >>>
> >> <description>
> >>
> >>> nutch nutch nutch ....nutch
> >>> nutch
> >>>
> >>>> </description>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> <link> http://lucene.apache.org/nutch</link>
> >>>
> >>>> <category>news
> >>>>
> >>> </category>
> >>>
> >>>> <author>kauu</author>
> >>>>
> >>
> >> On 1/31/07, Chris
> >>
> >>> Mattmann <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
> >>>
> >>> Hi there,
> >>>
> >>> I could most
> >>> likely be of assistance, if you gave me some more
> >>> information.
> >>> For
> >>> instance: I'm wondering if the use case you describe below is
> already
> >>>
> >>> supported by the current RSS parse plugin?
> >>>
> >>> The current RSS parser,
> >>> parse-rss, does in fact index individual items
> >>> that
> >>> are pointed to by an
> >>> RSS document. The items are added as Nutch Outlinks,
> >>> and added to the
> >>> overall queue of URLs to fetch. Doesn't this satisfy what
> >>> you mention below?
> >>> Or am I missing something?
> >>>
> >>> Cheers,
> >>> Chris
> >>>
> >>>
> >>>
> >>> On 1/30/07 6:01 PM,
> >>> "kauu" <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
> >>>
> >>>
> >>>> Hi folks :
> >>>>
> >>>> What's I want to
> >>>>
> >>> do is to separate a rss file into several pages .
> >>>
> >>>> Just as what has
> >>>>
> >>> been discussed before. I want fetch a rss page and
> >>> index
> >>>
> >>>> it as different
> >>>>
> >>> documents in the index. So the searcher can search the
> >>>
> >>>> Item's info as a
> >>>>
> >>> individual hit.
> >>>
> >>>> What's my opinion create a protocol for fetch the rss
> >>>>
> >>> page and store it
> >>> as
> >>>
> >>>> several one which just contain one ITEM tag .but
> >>>>
> >>> the unique key is the
> >>> url ,
> >>>
> >>>> so how can I store them with the ITEM's link
> >>>>
> >>> tag as the unique key for a
> >>>
> >>>> document.
> >>>>
> >>>> So my question is how to
> >>>>
> >>> realize this function in nutch-.0.8.x.
> >>>
> >>>> I've check the code of the
> >>>>
> >>> plug-in protocol-http's code ,but I can't
> >>>
> >>>> find the code where to store a
> >>>>
> >>> page to a document. I want to separate
> >>> the
> >>>
> >>>> rss page to several ones
> >>>>
> >>> before storing it as a document but several
> >>> ones.
> >>>
> >>>> So any one can
> >>>>
> >>> give me some hints?
> >>>
> >>>> Any reply will be appreciated !
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ITEM's structure
> >>>>
> >>>> <item>
> >>>>
> >>>>
> >>>> <title>欧洲暴风雪后发制人 致航班
> >>>>
> >>> 延误交通混乱(组图)</title>
> >>>
> >>>> <description>暴风雪横扫欧洲,导致多次航班延误 1
> >>>>
> >>> 月24日,几架民航客机在德
> >>>
> >>>> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
> >>>>
> >>> 的慕尼黑机场
> >>>
> >>>> 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> >>>>
> >>>>
> >>>> </description>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> <link> http://news.sohu.com/20070125
> >>>
> >>> <http://news.sohu.com/20070125/n247833568.shtml>
> /n247833568.shtml</
> >>>
> >>> link>
> >>>
> >>>> <category>搜狐焦点图新闻</category>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> <author> [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
> >>>
> >>>> </author>
> >>>>
> >>>>
> >>>> <pubDate>Thu, 25 Jan 2007
> >>>>
> >>> 11:29:11 +0800</pubDate>
> >>>
> >>>> <comments
> >>>>
> >>> http://comment.news.sohu.com
> >>>
> >>> < http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> >>>
> >>> /comment/topic.jsp?id=247833847</comments>
> >>>
> >>>> </item
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >
> > ______________________________________________
> > Chris A. Mattmann
> > [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
> > Staff Member
> > Modeling and Data Management Systems Section (387)
> > Data Management Systems and Technologies Group
> >
> > _________________________________________________
> > Jet Propulsion Laboratory Pasadena, CA
> > Office: 171-266B Mailstop: 171-246
> > _______________________________________________________
> >
> > Disclaimer: The opinions presented within are my own and do not
> reflect
> > those of either NASA, JPL, or the California Institute of
> Technology.
> >
> >
> >
> >
> >
> >
>
>
> --
> renaud richardet +1 617 230 9112
> renaud <at> oslutions.com <http://oslutions.com>
> http://www.oslutions.com
>
>
>
>
> --
> www.babatu.com <http://www.babatu.com>
--
renaud richardet +1 617 230 9112
renaud <at> oslutions.com http://www.oslutions.com
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general