kauu wrote: > I've change code like what u said, but i get an exception like this. > why, why is the MD5Signature class's exception Actually, I think it's a NullPointerException in ParseOutputFormat.java:121...
I would suggest you try the approach that Doug and Dog(acan are discussing, seems much faster and cleaner. My problem was that many blogs do not publish full-text feeds and I need to actually fetch the blog-post page, and match its DOM against the feed text. HTH, Renaud > > 2007-02-05 11:28:38,453 WARN feedparser.FeedFilter > (FeedFilter.java:doDecodeEntities (223)) - Filter encountered unknown > entities > 2007-02-05 11:28:39,390 INFO crawl.SignatureFactory > (SignatureFactory.java:getSignature(45)) - Using Signature impl: > org.apache.nutch.crawl.MD5Signature > 2007-02-05 11:28:40,078 WARN mapred.LocalJobRunner > (LocalJobRunner.java:run(120)) - job_f6j55m > java.lang.NullPointerException > at > org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:121) > at org.apache.nutch.fetcher.FetcherOutputFormat$1.write > (FetcherOutputFormat.java:87) > at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:235) > at > org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39) > at org.apache.hadoop.mapred.ReduceTask.run (ReduceTask.java:247) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112) > > > On 2/3/07, *Renaud Richardet* < [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > Gal, Chris, Kauu, > > So, if I understand correctly, you need a way to pass information > along > the fetches, so that when Nutch fetches a feed entry, its <item> value > previously fetched is available. > > This is how I tackled the issue: > - extend Outlinks.java and allow to create outlinks with more meta > data. > So, in your feed parser, use this way to create outlinks > - pass on the metadata through ParseOutputFormat.java and Fetcher.java > - retrieve the metadata in HtmlParser.java and use it > > This is very tedious, will blow the size of your outlinks db, makes > changes in the core code of Nutch, etc... But this is the only way I > came up with... > If someone sees a better way, please let me know :-) > > Sample code, for Nutch 0.8.x : > > Outlink.java > + public Outlink(String toUrl, String anchor, String entryContents, > Configuration conf) throws MalformedURLException { > + this.toUrl = new > UrlNormalizerFactory(conf).getNormalizer().normalize(toUrl); > + this.anchor = anchor; > + > + this.entryContents= entryContents; > + } > and update the other methods > > ParseOutputFormat.java, around lines 140 > + // set outlink info in metadata ME > + String entryContents= links[i].getEntryContents(); > + > + if ( entryContents.length() > 0) { // it's a feed entry > + MapWritable meta = new MapWritable(); > + meta.put(new UTF8("entryContents"), new > UTF8(entryContents));//key/value > + target = new CrawlDatum(CrawlDatum.STATUS_LINKED, > interval); > + target.setMetaData(meta); > + } else { > + target = new CrawlDatum(CrawlDatum.STATUS_LINKED , > interval); // no meta > + } > > Fetcher.java, around l. 266 > + // add feed info to metadata > + try { > + String entryContents = datum.getMetaData().get(new > UTF8("entryContents")).toString(); > + metadata.set("entryContents", entryContents); > + } catch (Exception e) { } //not found > > HtmlParser.java > // get entry metadata > String entryContents = content.getMetadata().get("entryContents"); > > HTH, > Renaud > > > > Gal Nitzan wrote: > > Hi Chris, > > > > I'm sorry I wasn't clear enough. What I mean is that in the > current implementation: > > > > 1. The RSS (channels, items) page ends up as one Lucene document > in the index. > > 2. Indeed the links are extracted and each <item> link will be > fetched in the next fetch as a separate page and will end up as > one Lucene document. > > > > IMHO the data that is needed i.e. the data that will be fetched > in the next fetch process is already available in the <item> > element. Each <item> element represents one web resource. And > there is no reason to go to the server and re-fetch that resource. > > > > Another issue that arises from rss feeds is that once the feed > page is fetched you can not re-fetch it until its "time to fetch" > expired. The feeds TTL is usually very short. Since for now in > Nutch, all pages created equal :) it is one more thing to think > about. > > > > HTH, > > > > Gal. > > > > -----Original Message----- > > From: Chris Mattmann [mailto:[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>] > > Sent: Thursday, February 01, 2007 7:01 PM > > To: [email protected] <mailto:[email protected]> > > Subject: Re: RSS-fecter and index individul-how can i realize > this function > > > > Hi Gal, et al., > > > > I'd like to be explicit when we talk about what the issue with > the RSS > > parsing plugin is here; I think we have had conversations > similar to this > > before and it seems that we keep talking around each other. I'd > like to get > > to the heart of this matter so that the issue (if there is an > actual one) > > gets addressed ;) > > > > Okay, so you mention below that the thing that you see missing > from the > > current RSS parsing plugin is the ability to store data in the > CrawlDatum, > > and parse "it" in the next fetch phase. Well, there are 2 > options here for > > what you refer to as "it": > > > > 1. If you're talking about the RSS file, then in fact, it is > parsed, and > > its data is stored in the CrawlDatum, akin to any other form of > content that > > is fetched, parsed and indexed. > > > > 2. If you're talking about the item links within the RSS file, > in fact, > > they are parsed (eventually), and their data stored in the > CrawlDatum, akin > > to any other form of content that is fetched, parsed, and > indexed. This is > > accomplished by adding the RSS items as Outlinks when the RSS > file is > > parsed: in this fashion, we go after all of the links in the RSS > file, and > > make sure that we index their content as well. > > > > Thus, if you had an RSS file R that contained links in it to a > PDF file A, > > and another HTML page P, then not only would R get fetched, > parsed, and > > indexed, but so would A and P, because they are item links > within R. Then > > queries that would match R (the physical RSS file), would > additionally match > > things such as P and A, and all 3 would be capable of being > returned in a > > Nutch query. Does this make sense? Is this the issue that you're > talking > > about? Am I nuts? ;) > > > > Cheers, > > Chris > > > > > > > > > > On 1/31/07 10:40 PM, "Gal Nitzan" < [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > > > > >> Hi, > >> > >> Many sites provide RSS feeds for several reasons, usually to > save bandwidth, > >> to give the users concentrated data and so forth. > >> > >> Some of the RSS files supplied by sites are created specially > for search > >> engines where each RSS "item" represent a web page in the site. > >> > >> IMHO the only thing "missing" in the parse-rss plugin is > storing the data in > >> the CrawlDatum and "parsing" it in the next fetch phase. Maybe > adding a new > >> flag to CrawlDatum, that would flag the URL as "parsable" not > "fetchable"? > >> > >> Just my two cents... > >> > >> Gal. > >> > >> -----Original Message----- > >> From: Chris Mattmann [mailto:[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>] > >> Sent: Wednesday, January 31, 2007 8:44 AM > >> To: [email protected] > <mailto:[email protected]> > >> Subject: Re: RSS-fecter and index individul-how can i realize > this function > >> > >> Hi there, > >> > >> With the explanation that you give below, it seems like > parse-rss as it > >> exists would address what you are trying to do. parse-rss > parses an RSS > >> channel as a set of items, and indexes overall metadata about > the RSS file, > >> including parse text, and index data, but it also adds each > item (in the > >> channel)'s URL as an Outlink, so that Nutch will process those > pieces of > >> content as well. The only thing that you suggest below that > parse-rss > >> currently doesn't do, is to allow you to associate the metadata > fields > >> category:, and author: with the item Outlink... > >> > >> Cheers, > >> Chris > >> > >> > >> > >> On 1/30/07 7:30 PM, "kauu" <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > >> > >> > >>> thx for ur reply . > >>> > >> mybe i didn't tell clearly . > >> I want to index the item as a > >> > >>> individual page .then when i search the some > >>> > >> thing for example "nutch-open > >> > >>> source", the nutch return a hit which contain > >>> > >> title : nutch-open source > >> > >> > >>> description : nutch nutch nutch ....nutch nutch > >>> > >> url : > >> > >>> http://lucene.apache.org/nutch > >>> > >> category : news > >> author : kauu > >> > >> so , is > >> > >>> the plugin parse-rss can satisfy what i need? > >>> > >> <item> > >> <title>nutch--open > >> > >>> source</title> > >>> > >> <description> > >> > >>> nutch nutch nutch ....nutch > >>> nutch > >>> > >>>> </description> > >>>> > >>>> > >>>> > >>>> > >>> <link> http://lucene.apache.org/nutch</link> > >>> > >>>> <category>news > >>>> > >>> </category> > >>> > >>>> <author>kauu</author> > >>>> > >> > >> On 1/31/07, Chris > >> > >>> Mattmann <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > >>> > >>> Hi there, > >>> > >>> I could most > >>> likely be of assistance, if you gave me some more > >>> information. > >>> For > >>> instance: I'm wondering if the use case you describe below is > already > >>> > >>> supported by the current RSS parse plugin? > >>> > >>> The current RSS parser, > >>> parse-rss, does in fact index individual items > >>> that > >>> are pointed to by an > >>> RSS document. The items are added as Nutch Outlinks, > >>> and added to the > >>> overall queue of URLs to fetch. Doesn't this satisfy what > >>> you mention below? > >>> Or am I missing something? > >>> > >>> Cheers, > >>> Chris > >>> > >>> > >>> > >>> On 1/30/07 6:01 PM, > >>> "kauu" <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > >>> > >>> > >>>> Hi folks : > >>>> > >>>> What's I want to > >>>> > >>> do is to separate a rss file into several pages . > >>> > >>>> Just as what has > >>>> > >>> been discussed before. I want fetch a rss page and > >>> index > >>> > >>>> it as different > >>>> > >>> documents in the index. So the searcher can search the > >>> > >>>> Item's info as a > >>>> > >>> individual hit. > >>> > >>>> What's my opinion create a protocol for fetch the rss > >>>> > >>> page and store it > >>> as > >>> > >>>> several one which just contain one ITEM tag .but > >>>> > >>> the unique key is the > >>> url , > >>> > >>>> so how can I store them with the ITEM's link > >>>> > >>> tag as the unique key for a > >>> > >>>> document. > >>>> > >>>> So my question is how to > >>>> > >>> realize this function in nutch-.0.8.x. > >>> > >>>> I've check the code of the > >>>> > >>> plug-in protocol-http's code ,but I can't > >>> > >>>> find the code where to store a > >>>> > >>> page to a document. I want to separate > >>> the > >>> > >>>> rss page to several ones > >>>> > >>> before storing it as a document but several > >>> ones. > >>> > >>>> So any one can > >>>> > >>> give me some hints? > >>> > >>>> Any reply will be appreciated ! > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> ITEM's structure > >>>> > >>>> <item> > >>>> > >>>> > >>>> <title>欧洲暴风雪后发制人 致航班 > >>>> > >>> 延误交通混乱(组图)</title> > >>> > >>>> <description>暴风雪横扫欧洲,导致多次航班延误 1 > >>>> > >>> 月24日,几架民航客机在德 > >>> > >>>> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部 > >>>> > >>> 的慕尼黑机场 > >>> > >>>> 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... > >>>> > >>>> > >>>> </description> > >>>> > >>>> > >>>> > >>>> > >>> <link> http://news.sohu.com/20070125 > >>> > >>> <http://news.sohu.com/20070125/n247833568.shtml> > /n247833568.shtml</ > >>> > >>> link> > >>> > >>>> <category>搜狐焦点图新闻</category> > >>>> > >>>> > >>>> > >>>> > >>> <author> [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> > >>> > >>>> </author> > >>>> > >>>> > >>>> <pubDate>Thu, 25 Jan 2007 > >>>> > >>> 11:29:11 +0800</pubDate> > >>> > >>>> <comments > >>>> > >>> http://comment.news.sohu.com > >>> > >>> < http://comment.news.sohu.com/comment/topic.jsp?id=247833847> > >>> > >>> /comment/topic.jsp?id=247833847</comments> > >>> > >>>> </item > >>>> > >>>> > >>>> > >>> > >>> > > > > ______________________________________________ > > Chris A. Mattmann > > [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> > > Staff Member > > Modeling and Data Management Systems Section (387) > > Data Management Systems and Technologies Group > > > > _________________________________________________ > > Jet Propulsion Laboratory Pasadena, CA > > Office: 171-266B Mailstop: 171-246 > > _______________________________________________________ > > > > Disclaimer: The opinions presented within are my own and do not > reflect > > those of either NASA, JPL, or the California Institute of > Technology. > > > > > > > > > > > > > > > -- > renaud richardet +1 617 230 9112 > renaud <at> oslutions.com <http://oslutions.com> > http://www.oslutions.com > > > > > -- > www.babatu.com <http://www.babatu.com> -- renaud richardet +1 617 230 9112 renaud <at> oslutions.com http://www.oslutions.com
