Re: RSS-fecter and index individul-how can i realize this function
On Mon, Jan 5, 2009 at 7:00 AM, Vlad Cananau vlad...@gmail.com wrote: Hello I'm trying to make RSSParser do something simmilar to FeedParser (which doesn't work quite right) - that is, instead of indexing the whole contents Why doesn't FeedParser work? Let's fix whatever is broken in it :D of the feed, I want it to show individual items, with their respective title and and proper link to the article I realize that I could index 1 depth more, but I'd like to index just the feed, not the articles that go with it (keep the index small and the crawl fast). For each item in each RSS channel (the code does not differ much for getParse() of RSSParser.java) I do something like Outlink[] outlinks = new Outlink[1]; try{ outlinks[0] = new Outlink(whichLink, theRSSItem.getTitle()); } catch (Exception e) { continue; } parseResult.put( whichLink, new ParseText(theRSSItem.getTitle() + theRSSItem.getDescription()), new ParseData( ParseStatus.STATUS_SUCCESS, theRSSItem.getTitle(), outlinks, new Metadata() //was content.getMetadata() ) ); The problem is, however, that only one item from the whole RSS gets into the index, although in the log I can see them all ( I've tried it with feeds from cnn and reuters). What happens? Why do they get overwritten in a seemingly random order? The item that makes it into the index is neither the first nor the last, but appears to be the same until new items appear in the feed. Thank you, Vlad -- Doğacan Güney
Re: RSS-fecter and index individul-how can i realize this function
On Mon, Jan 5, 2009 at 12:32 PM, Doğacan Güney doga...@gmail.com wrote: On Mon, Jan 5, 2009 at 7:00 AM, Vlad Cananau vlad...@gmail.com wrote: Hello I'm trying to make RSSParser do something simmilar to FeedParser (which doesn't work quite right) - that is, instead of indexing the whole contents Why doesn't FeedParser work? Let's fix whatever is broken in it :D of the feed, I want it to show individual items, with their respective title and and proper link to the article I realize that I could index 1 depth more, but I'd like to index just the feed, not the articles that go with it (keep the index small and the crawl fast). For each item in each RSS channel (the code does not differ much for getParse() of RSSParser.java) I do something like Outlink[] outlinks = new Outlink[1]; try{ outlinks[0] = new Outlink(whichLink, theRSSItem.getTitle()); } catch (Exception e) { continue; } parseResult.put( whichLink, new ParseText(theRSSItem.getTitle() + theRSSItem.getDescription()), new ParseData( ParseStatus.STATUS_SUCCESS, theRSSItem.getTitle(), outlinks, new Metadata() //was content.getMetadata() ) ); The problem is, however, that only one item from the whole RSS gets into the index, although in the log I can see them all ( I've tried it with feeds from cnn and reuters). What happens? Why do they get overwritten in a seemingly random order? The item that makes it into the index is neither the first nor the last, but appears to be the same until new items appear in the feed. Thank you, Vlad -- Doğacan Güney when using FeedParser, not all of the feeds make it into the index. For example, I crawl both Entertainment and Politics, but I get results only for some of the articles. Is there any way to check wether or not entries make it into the index? I see, in the log Indexing http://rss.cnn.com/... with analyzer org.apache.nutch.analyzer.NutchDocumentAnalyzer (something) (I'm not able to crawl right now, since I don't have access to the machine). But when I look for keywords specific to some of the documents, I don't get any results :-(
Re: RSS-fecter and index individul-how can i realize this function
Doğacan Güney wrote: On Mon, Jan 5, 2009 at 7:00 AM, Vlad Cananau vlad...@gmail.com wrote: Hello I'm trying to make RSSParser do something simmilar to FeedParser (which doesn't work quite right) - that is, instead of indexing the whole contents Why doesn't FeedParser work? Let's fix whatever is broken in it :D of the feed, I want it to show individual items, with their respective title and and proper link to the article I realize that I could index 1 depth more, but I'd like to index just the feed, not the articles that go with it (keep the index small and the crawl fast). For each item in each RSS channel (the code does not differ much for getParse() of RSSParser.java) I do something like Outlink[] outlinks = new Outlink[1]; try{ outlinks[0] = new Outlink(whichLink, theRSSItem.getTitle()); } catch (Exception e) { continue; } parseResult.put( whichLink, new ParseText(theRSSItem.getTitle() + theRSSItem.getDescription()), new ParseData( ParseStatus.STATUS_SUCCESS, theRSSItem.getTitle(), outlinks, new Metadata() //was content.getMetadata() ) ); The problem is, however, that only one item from the whole RSS gets into the index, although in the log I can see them all ( I've tried it with feeds from cnn and reuters). What happens? Why do they get overwritten in a seemingly random order? The item that makes it into the index is neither the first nor the last, but appears to be the same until new items appear in the feed. Thank you, Vlad In order to show you what I mean by only one item gets into the index, check out these results http://tinyurl.com/7hkkoo*http://tinyurl.com/7hkkoo [link http://vladk2k.homeip.net:8080 - my own server]*
Re: RSS-fecter and index individul-how can i realize this function
Hello I'm trying to make RSSParser do something simmilar to FeedParser (which doesn't work quite right) - that is, instead of indexing the whole contents of the feed, I want it to show individual items, with their respective title and and proper link to the article I realize that I could index 1 depth more, but I'd like to index just the feed, not the articles that go with it (keep the index small and the crawl fast). For each item in each RSS channel (the code does not differ much for getParse() of RSSParser.java) I do something like Outlink[] outlinks = new Outlink[1]; try{ outlinks[0] = new Outlink(whichLink, theRSSItem.getTitle()); } catch (Exception e) { continue; } parseResult.put( whichLink, new ParseText(theRSSItem.getTitle() + theRSSItem.getDescription()), new ParseData( ParseStatus.STATUS_SUCCESS, theRSSItem.getTitle(), outlinks, new Metadata() //was content.getMetadata() ) ); The problem is, however, that only one item from the whole RSS gets into the index, although in the log I can see them all ( I've tried it with feeds from cnn and reuters). What happens? Why do they get overwritten in a seemingly random order? The item that makes it into the index is neither the first nor the last, but appears to be the same until new items appear in the feed. Thank you, Vlad
Re: RSS-fecter and index individul-how can i realize this function
Where can I find Scott's solution? I am trying to do it exactly like Scott, but i cannot imagine how to index items separately. Please, can anybody help me? Many thanks Miro sdeck wrote: So, here is what I do for RSS Feeds. I parse the rss, and for each outlink, I create the outlink object and set inside the anchor text for each outlink a well formed xml string. It contains the pub date, description, etc. Now, this is only because I was hacking the outlink to just use it's anchor text, but you could always just create a new MetaData object for use with an outlink. So, then next time that url is called up, and you then get an html parser, then you could look at the outlinks metadata and say, hey, look you came from an rss feed. So, I can either just use your stored Metadata and not parse the html, or I could combine your meta data with what comes from the html, etc. I have found that to be the best solutions Also, when I parse the rss feed, I set a meat tag called noindex, so in my basic indexer, if that is in there, I do not include the rss feed page in the Lucene index. Scott Doug Cutting wrote: Chris Mattmann wrote: Got it. So, the logic behind this is, why bother waiting until the following fetch to parse (and create ParseData objects from) the RSS items out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the RSS metadata in it. However, it's perfectly acceptable to have feeds that simply have a title, description, and link in it. Almost. The feed may have less than the referenced page, but it's also a lot easier to parse, since the link could be an anchor within a large page, or could be a page that has lots of navigation links, spam comments, etc. So feed entries are generally much more precise than the pages they reference, and may make for a higher-quality search experience. I guess this is still valuable metadata information to have, however, the only caveat is that the implication of the proposed change is: 1. We won't have cached copies, or fetched copies of the Content represented by the item links. Therefore, in this model, we won't be able to pull up a Nutch cache of the page corresponding to the RSS item, because we are circumventing the fetch step Good point. We indeed wouldn't have these URLs in the cache. 2. It sounds like a pretty fundamental API shift in Nutch, to support a single type of content, RSS. Even if there are more content types that follow this model, as Doug and Renaud both pointed out, there aren't a multitude of them (perhaps archive files, but can you think of any others)? Also true. On the other hand, Nutch provides 98% of an RSS search engine. It'd be a shame to have to re-invent everything else and it would be great if Nutch could evolve to support RSS well. Could image search might also benefit from this? One could generate a Parse for each image on a page whose text was from the page. Product search too, perhaps. The other main thing that comes to mind about this for me is it prevents the fetched Content for the RSS items from being able to provide useful metadata, in the sense that it doesn't explicitly fetch the content. What if we wanted to apply some super cool metadata extractor X that used word-stemming, HTML design analysis, and other techniques to extract metadata from the content pointed to by an RSS item link? In the proposed model, we assume that the RSS xml item tag already contains all necessary metadata for indexing, which in my mind, limits the model. Does what I am saying make sense? I'm not shooting down the issue, I'm just trying to brainstorm a bit here about the issue. Sure, the RSS feed may contain less than the page it references, but that might be all that one wishes to index. Otherwise, if, e.g., a blog includes titles from other recent posts you're going to get lots of false positives. Ideally Nutch should support various options: searching the feed only, searching the referenced page only, or perhaps searching both. Doug -- View this message in context: http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tp8722009p20815016.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
RE: RSS-fecter and index individul-how can i realize this function
2. It sounds like a pretty fundamental API shift in Nutch, to support a single type of content, RSS. Even if there are more content types that follow this model, as Doug and Renaud both pointed out, there aren't a multitude of them (perhaps archive files, but can you think of any others)? Also true. On the other hand, Nutch provides 98% of an RSS search engine. It'd be a shame to have to re-invent everything else and it would be great if Nutch could evolve to support RSS well. Could image search might also benefit from this? One could generate a Parse for each image on a page whose text was from the page. Product search too, perhaps. Another application could be splitting certain enterprise documents up, either based on passage retrieval algorithms or simply based on the table of content entries. For example, a long contract or user guide could be split up into separate searchable documents. Best regards, Alan _ Alan Tanaman iDNA Solutions http://blog.idna-solutions.com
Re: RSS-fecter and index individul-how can i realize this function
Hi Doug, Okay, I see your points. It seems like this would be really useful for some current folks, and for Nutch going forward. I see that there has been some initial work today and preparing patches. I'd be happy to shepherd this into the sources. I will begin reviewing what's required, and contacting the folks who've begun work on this issue. Thanks! Cheers, Chris On 2/7/07 1:31 PM, Doug Cutting [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Got it. So, the logic behind this is, why bother waiting until the following fetch to parse (and create ParseData objects from) the RSS items out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the RSS metadata in it. However, it's perfectly acceptable to have feeds that simply have a title, description, and link in it. Almost. The feed may have less than the referenced page, but it's also a lot easier to parse, since the link could be an anchor within a large page, or could be a page that has lots of navigation links, spam comments, etc. So feed entries are generally much more precise than the pages they reference, and may make for a higher-quality search experience. I guess this is still valuable metadata information to have, however, the only caveat is that the implication of the proposed change is: 1. We won't have cached copies, or fetched copies of the Content represented by the item links. Therefore, in this model, we won't be able to pull up a Nutch cache of the page corresponding to the RSS item, because we are circumventing the fetch step Good point. We indeed wouldn't have these URLs in the cache. 2. It sounds like a pretty fundamental API shift in Nutch, to support a single type of content, RSS. Even if there are more content types that follow this model, as Doug and Renaud both pointed out, there aren't a multitude of them (perhaps archive files, but can you think of any others)? Also true. On the other hand, Nutch provides 98% of an RSS search engine. It'd be a shame to have to re-invent everything else and it would be great if Nutch could evolve to support RSS well. Could image search might also benefit from this? One could generate a Parse for each image on a page whose text was from the page. Product search too, perhaps. The other main thing that comes to mind about this for me is it prevents the fetched Content for the RSS items from being able to provide useful metadata, in the sense that it doesn't explicitly fetch the content. What if we wanted to apply some super cool metadata extractor X that used word-stemming, HTML design analysis, and other techniques to extract metadata from the content pointed to by an RSS item link? In the proposed model, we assume that the RSS xml item tag already contains all necessary metadata for indexing, which in my mind, limits the model. Does what I am saying make sense? I'm not shooting down the issue, I'm just trying to brainstorm a bit here about the issue. Sure, the RSS feed may contain less than the page it references, but that might be all that one wishes to index. Otherwise, if, e.g., a blog includes titles from other recent posts you're going to get lots of false positives. Ideally Nutch should support various options: searching the feed only, searching the referenced page only, or perhaps searching both. Doug
FW: RSS-fecter and index individul-how can i realize this function
I send again this message as it apparently didn't go through. (I am messing up with my email addresses on the mailing list...) -Original Message- Sent: Friday, February 02, 2007 10:29 AM Using Nutch 0.8, we modified the code starting at the fetching/parsing steps and the following. We have a different implementation of the Parse Object and OutputFormat including an additional list of ParseData objects saved in an additionnal subfolder in the DFS. We changed the indexing step a lot too, so we don't use the nutch code there. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Friday, February 02, 2007 10:19 AM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Attention, votre correspondant continue de vous écrire à votre ancienne adresse en @orange-ft.com, qui va être désactivée début avril. Veuillez lui demander de mettre à jour son carnet d'adresses avec votre nouvelle adresse en @orange-ftgroup.com. Caution : your correspondent is still writing to your orange-ft.com address, which will be disabled beginning of April. Please ask him/her to update his/her address book to orange-ftgroup.com .. Gal Nitzan wrote: IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Perhaps ProtocolOutput should change. The method: Content getContent(); could be deprecated and replaced with: Content[] getContents(); This would require changes to the indexing pipeline. I can't think of any severe complications, but I haven't looked closely. Could something like that work? Doug
Re: FW: RSS-fecter and index individul-how can i realize this function
HUYLEBROECK Jeremy RD-ILAB-SSF wrote: I send again this message as it apparently didn't go through. (I am messing up with my email addresses on the mailing list...) -Original Message- Sent: Friday, February 02, 2007 10:29 AM Using Nutch 0.8, we modified the code starting at the fetching/parsing steps and the following. We have a different implementation of the Parse Object and OutputFormat including an additional list of ParseData objects saved in an additionnal subfolder in the DFS. We changed the indexing step a lot too, so we don't use the nutch code there. Is your implementation similar to what we started at https://issues.apache.org/jira/browse/NUTCH-443? If you think some of your changes could be integrated, please post a patch there. Thanks for sharing, Renaud -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Friday, February 02, 2007 10:19 AM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Attention, votre correspondant continue de vous écrire à votre ancienne adresse en @orange-ft.com, qui va être désactivée début avril. Veuillez lui demander de mettre à jour son carnet d'adresses avec votre nouvelle adresse en @orange-ftgroup.com. Caution : your correspondent is still writing to your orange-ft.com address, which will be disabled beginning of April. Please ask him/her to update his/her address book to orange-ftgroup.com .. Gal Nitzan wrote: IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Perhaps ProtocolOutput should change. The method: Content getContent(); could be deprecated and replaced with: Content[] getContents(); This would require changes to the indexing pipeline. I can't think of any severe complications, but I haven't looked closely. Could something like that work? Doug -- Renaud Richardet +1 617 230 9112 my email is my first name at apache.org http://www.oslutions.com
Re: RSS-fecter and index individul-how can i realize this function
So, here is what I do for RSS Feeds. I parse the rss, and for each outlink, I create the outlink object and set inside the anchor text for each outlink a well formed xml string. It contains the pub date, description, etc. Now, this is only because I was hacking the outlink to just use it's anchor text, but you could always just create a new MetaData object for use with an outlink. So, then next time that url is called up, and you then get an html parser, then you could look at the outlinks metadata and say, hey, look you came from an rss feed. So, I can either just use your stored Metadata and not parse the html, or I could combine your meta data with what comes from the html, etc. I have found that to be the best solutions Also, when I parse the rss feed, I set a meat tag called noindex, so in my basic indexer, if that is in there, I do not include the rss feed page in the Lucene index. Scott Doug Cutting wrote: Chris Mattmann wrote: Got it. So, the logic behind this is, why bother waiting until the following fetch to parse (and create ParseData objects from) the RSS items out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the RSS metadata in it. However, it's perfectly acceptable to have feeds that simply have a title, description, and link in it. Almost. The feed may have less than the referenced page, but it's also a lot easier to parse, since the link could be an anchor within a large page, or could be a page that has lots of navigation links, spam comments, etc. So feed entries are generally much more precise than the pages they reference, and may make for a higher-quality search experience. I guess this is still valuable metadata information to have, however, the only caveat is that the implication of the proposed change is: 1. We won't have cached copies, or fetched copies of the Content represented by the item links. Therefore, in this model, we won't be able to pull up a Nutch cache of the page corresponding to the RSS item, because we are circumventing the fetch step Good point. We indeed wouldn't have these URLs in the cache. 2. It sounds like a pretty fundamental API shift in Nutch, to support a single type of content, RSS. Even if there are more content types that follow this model, as Doug and Renaud both pointed out, there aren't a multitude of them (perhaps archive files, but can you think of any others)? Also true. On the other hand, Nutch provides 98% of an RSS search engine. It'd be a shame to have to re-invent everything else and it would be great if Nutch could evolve to support RSS well. Could image search might also benefit from this? One could generate a Parse for each image on a page whose text was from the page. Product search too, perhaps. The other main thing that comes to mind about this for me is it prevents the fetched Content for the RSS items from being able to provide useful metadata, in the sense that it doesn't explicitly fetch the content. What if we wanted to apply some super cool metadata extractor X that used word-stemming, HTML design analysis, and other techniques to extract metadata from the content pointed to by an RSS item link? In the proposed model, we assume that the RSS xml item tag already contains all necessary metadata for indexing, which in my mind, limits the model. Does what I am saying make sense? I'm not shooting down the issue, I'm just trying to brainstorm a bit here about the issue. Sure, the RSS feed may contain less than the page it references, but that might be all that one wishes to index. Otherwise, if, e.g., a blog includes titles from other recent posts you're going to get lots of false positives. Ideally Nutch should support various options: searching the feed only, searching the referenced page only, or perhaps searching both. Doug -- View this message in context: http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html#a8876127 Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: RSS-fecter and index individul-how can i realize this function
Renaud Richardet wrote: I see. I was thinking that I could index the feed items without having to fetch them individually. Okay, so if Parser#parse returned a MapString,Parse, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? So now the question is, how much impact would this change to the Parser API have on the rest of Nutch? It would require changes to all Parser implementations, to ParseSegement, to ParseUtil, and to Fetcher. But, as far as I can tell, most of these changes look straightforward. Doug
Re: RSS-fecter and index individul-how can i realize this function
Guys, Sorry to be so thick-headed, but could someone explain to me in really simple language what this change is requesting that is different from the current Nutch API? I still don't get it, sorry... Cheers, Chris On 2/7/07 9:58 AM, Doug Cutting [EMAIL PROTECTED] wrote: Renaud Richardet wrote: I see. I was thinking that I could index the feed items without having to fetch them individually. Okay, so if Parser#parse returned a MapString,Parse, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? So now the question is, how much impact would this change to the Parser API have on the rest of Nutch? It would require changes to all Parser implementations, to ParseSegement, to ParseUtil, and to Fetcher. But, as far as I can tell, most of these changes look straightforward. Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: RSS-fecter and index individul-how can i realize this function
Doug Cutting wrote: Renaud Richardet wrote: I see. I was thinking that I could index the feed items without having to fetch them individually. Okay, so if Parser#parse returned a MapString,Parse, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? Exactly. So now the question is, how much impact would this change to the Parser API have on the rest of Nutch? It would require changes to all Parser implementations, to ParseSegement, to ParseUtil, and to Fetcher. But, as far as I can tell, most of these changes look straightforward. I think so, too. I have opened an issue in JIRA (https://issues.apache.org/jira/browse/NUTCH-443) and will give it a try. Doğacan, have you started working on it yet? Thanks, Renaud
Re: RSS-fecter and index individul-how can i realize this function
Chris Mattmann wrote: Sorry to be so thick-headed, but could someone explain to me in really simple language what this change is requesting that is different from the current Nutch API? I still don't get it, sorry... A Content would no longer generate a single Parse. Instead, a Content could potentially generate many Parses. For most types of content, e.g., HTML, each Content would still generate a single Parse. But for RSS, a Content might generate multiple Parses, each indexed separately and each with a distinct URL. Another potential application could be processing archives: the parser could unpack the archive and each item in it indexed separately rather than indexing the archive as a whole. This only makes sense if each item has a distinct URL, which it does in RSS, but it might not in an archive. However some archive file formats do contain URLs, like that used by the Internet Archive. http://www.archive.org/web/researcher/ArcFileFormat.php Does that help? Doug
Re: RSS-fecter and index individul-how can i realize this function
Also true. On the other hand, Nutch provides 98% of an RSS search engine. It'd be a shame to have to re-invent everything else and it would be great if Nutch could evolve to support RSS well. Could image search might also benefit from this? One could generate a Parse for each image on a page whose text was from the page. Product search too, perhaps. These are excellent points I am totally +1 for the api change, it opens doors for a lot of new possible applications. -- Sami Siren
Re: RSS-fecter and index individul-how can i realize this function
Renaud Richardet wrote: Doug Cutting wrote: Renaud Richardet wrote: I see. I was thinking that I could index the feed items without having to fetch them individually. Okay, so if Parser#parse returned a MapString,Parse, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? Exactly. So now the question is, how much impact would this change to the Parser API have on the rest of Nutch? It would require changes to all Parser implementations, to ParseSegement, to ParseUtil, and to Fetcher. But, as far as I can tell, most of these changes look straightforward. I think so, too. I have opened an issue in JIRA (https://issues.apache.org/jira/browse/NUTCH-443) and will give it a try. Doğacan, have you started working on it yet? I have just started working on it. I hope I will have something (at least a patch for everything but plugins) within the day. -- Doğacan Güney Thanks, Renaud
Re: RSS-fecter and index individul-how can i realize this function
Hi, Doug Cutting wrote: Doğacan Güney wrote: I think it would make much more sense to change parse plugins to take content and return Parse[] instead of Parse. You're right. That does make more sense. OK, then should I go forward with this and implement something? This should be pretty easy, though I am not sure what to give as keys to a Parse[]. I mean, when getParse returned a single Parse, ParseSegment output them as url, Parse. But, if getParse returns an array, what will be the key for each element? Something like url#i, Parse[i] may work, but this may cause problems in dedup(for example, assume we fetched the same rss feed twice, and indexed them in different indexes. Two version's url#0 may be different items but since they have the same key, dedup will delete the older). -- Doğacan Güney Doug
Re: RSS-fecter and index individul-how can i realize this function
Hi, IMO it should stay the same. URL as the key and in the filter each item link element becomes the key. I will be happy to convert the current parse-rss filter to the suggested implementation. Gal. -- Original Message -- Received: Tue, 06 Feb 2007 10:36:03 AM IST From: Doğacan Güney [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi, Doug Cutting wrote: Doğacan Güney wrote: I think it would make much more sense to change parse plugins to take content and return Parse[] instead of Parse. You're right. That does make more sense. OK, then should I go forward with this and implement something? This should be pretty easy, though I am not sure what to give as keys to a Parse[]. I mean, when getParse returned a single Parse, ParseSegment output them as url, Parse. But, if getParse returns an array, what will be the key for each element? Something like url#i, Parse[i] may work, but this may cause problems in dedup(for example, assume we fetched the same rss feed twice, and indexed them in different indexes. Two version's url#0 may be different items but since they have the same key, dedup will delete the older). -- Doğacan Güney Doug
Re: RSS-fecter and index individul-how can i realize this function
Doğacan Güney wrote: OK, then should I go forward with this and implement something? This should be pretty easy, though I am not sure what to give as keys to a Parse[]. I mean, when getParse returned a single Parse, ParseSegment output them as url, Parse. But, if getParse returns an array, what will be the key for each element? Perhaps Parser#parser could return a MapString,Parse, where the keys are URLs? Something like url#i, Parse[i] may work, but this may cause problems in dedup(for example, assume we fetched the same rss feed twice, and indexed them in different indexes. Two version's url#0 may be different items but since they have the same key, dedup will delete the older). If the feed contains unique ids for items, then that can be used to qualify the URL. Otherwise one could use the hash of the link of the item. Since the target of the link must still be indexed separately from the item itself, how much use is all this? If the RSS document is considered a single page that changes frequently, and item's links are considered ordinary outlinks, isn't much the same effect achieved? Doug
Re: RSS-fecter and index individul-how can i realize this function
Hi Doug, Since the target of the link must still be indexed separately from the item itself, how much use is all this? If the RSS document is considered a single page that changes frequently, and item's links are considered ordinary outlinks, isn't much the same effect achieved? IMHO, yes. That's what it's been hard for me to understand the real use case for what Gal et al. are talking about. I've been trying to wrap my head around it, but it seems to me the capability they require is sort of already provided... Cheers, Chris Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: RSS-fecter and index individul-how can i realize this function
Hi Chris, Doug, Chris Mattmann wrote: Hi Doug, Since the target of the link must still be indexed separately from the item itself, how much use is all this? If the RSS document is considered a single page that changes frequently, and item's links are considered ordinary outlinks, isn't much the same effect achieved? IMHO, yes. That's what it's been hard for me to understand the real use case for what Gal et al. are talking about. I've been trying to wrap my head around it, but it seems to me the capability they require is sort of already provided... Not sure I understand: An RSS-feed is a collection of feed-entries, and each feed-entry would be indexed a a separate document (each feed-entry has a url or uuid as unique identifier). What happens with the RSS-feed itself? Is it indexed, or considered as a container that just needs to be fetched and fetched again for new entries? The usecase is that you index RSS-feeds, but your users can search each feed-entry as a single document. Does it makes sense? Thanks, Renaud Cheers, Chris Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -- Renaud Richardet +1 617 230 9112 my email is my first name at apache.org http://www.oslutions.com
Re: RSS-fecter and index individul-how can i realize this function
Renaud Richardet wrote: The usecase is that you index RSS-feeds, but your users can search each feed-entry as a single document. Does it makes sense? But each feed item also contains a link whose content will be indexed and that's generally a superset of the item. So should there be two urls indexed per item? In many cases, the best thing to do is to index only the linked page, not the feed item at all. In some (rare?) cases, there might be items without a link, whose only content is directly in the feed, or where the content in the feed is complementary to that in the linked page. In these cases it might be useful to combine the two (the feed item and the linked content), indexing both. The proposed change might permit that. Is that the case you're concerned about? Doug
Re: RSS-fecter and index individul-how can i realize this function
Doug Cutting wrote: Renaud Richardet wrote: The usecase is that you index RSS-feeds, but your users can search each feed-entry as a single document. Does it makes sense? But each feed item also contains a link whose content will be indexed and that's generally a superset of the item. Agreed So should there be two urls indexed per item? I don't think so In many cases, the best thing to do is to index only the linked page, not the feed item at all. In some (rare?) cases, there might be items without a link, whose only content is directly in the feed, or where the content in the feed is complementary to that in the linked page. In these cases it might be useful to combine the two (the feed item and the linked content), indexing both. The proposed change might permit that. Is that the case you're concerned about? I see. I was thinking that I could index the feed items without having to fetch them individually. More fundamentally, I want to index only the blog-entry text, and not the elements around it (header, menus, ads, ...), so as to improve the search results. Here's my case, the proposed changes would allow me to do (*) 1) parse feeds: for each (feedentry : feed) do | | if (full-text entries) then | | index each feed entry as a single document; blog header, menus are not indexed. * | else | | create a special outlink for each feed entry, which include metadata (content, time, etc) | endif | done 2) on a next fetch loop: for each (link) do | | if (this is a normal link) || fetch it and index it normally | else if (this link come from an already indexed feed entry) then || end, do not fetch it * | else if (this is a special outlink) || guess which DOM nodes hold the post content || index it; blog header, menus are not indexed. | endif | done Thanks, Renaud
Re: RSS-fecter and index individul-how can i realize this function
Renaud Richardet wrote: Doug Cutting wrote: Renaud Richardet wrote: The usecase is that you index RSS-feeds, but your users can search each feed-entry as a single document. Does it makes sense? But each feed item also contains a link whose content will be indexed and that's generally a superset of the item. Agreed So should there be two urls indexed per item? I don't think so In many cases, the best thing to do is to index only the linked page, not the feed item at all. In some (rare?) cases, there might be items without a link, whose only content is directly in the feed, or where the content in the feed is complementary to that in the linked page. In these cases it might be useful to combine the two (the feed item and the linked content), indexing both. The proposed change might permit that. Is that the case you're concerned about? I see. I was thinking that I could index the feed items without having to fetch them individually. More fundamentally, I want to index only the blog-entry text, and not the elements around it (header, menus, ads, ...), so as to improve the search results. Here's my case, the proposed changes would allow me to do (*) 1) parse feeds: for each (feedentry : feed) do | | if (full-text entries) then | | index each feed entry as a single document; blog header, menus are not indexed. * | else | | create a special outlink for each feed entry, which include metadata (content, time, etc) | endif | done 2) on a next fetch loop: for each (link) do | | if (this is a normal link) || fetch it and index it normally | else if (this link come from an already indexed feed entry) then || end, do not fetch it * | else if (this is a special outlink) || guess which DOM nodes hold the post content || index it; blog header, menus are not indexed. | endif | done I agree with Renaud Richardet. Also, I think it all boils down to speed. if you are building a blog search engine, you want it to update feeds as fast as it can. Doing 2 depths(one for rss-feed, one for outlinks) will slow it down. Besides that, many blog crawlers(like http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html) set crawl-delay to 1 and so I guess most of the web servers are OK with that for rss-feeds, but not necessarily OK with it for HTML pages. (So you will do depth 1(rss-feeds) very fast(with a 1 second delay), and then get the items with 5 second delay.) (I hope it is not stupid to point out Yahoo's crawler to someone who works at Yahoo :) -- Doğacan Güney Thanks, Renaud
Re: RSS-fecter and index individul-how can i realize this function
Doug Cutting wrote: Gal Nitzan wrote: IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Perhaps ProtocolOutput should change. The method: Content getContent(); could be deprecated and replaced with: Content[] getContents(); This would require changes to the indexing pipeline. I can't think of any severe complications, but I haven't looked closely. Since getProtocolOutput is called by Fetcher, fetcher(actually, the underlying protocol plugin) needs to be aware that we are actually fetching a rss feed and partially parse it to return an array of Contents. I think it would make much more sense to change parse plugins to take content and return Parse[] instead of Parse. -- Doğacan Güney Could something like that work? Doug
Re: RSS-fecter and index individul-how can i realize this function
Doğacan Güney wrote: I think it would make much more sense to change parse plugins to take content and return Parse[] instead of Parse. You're right. That does make more sense. Doug
Re: RSS-fecter and index individul-how can i realize this function
I've change code like what u said, but i get an exception like this. why, why is the MD5Signature class's exception 2007-02-05 11:28:38,453 WARN feedparser.FeedFilter ( FeedFilter.java:doDecodeEntities(223)) - Filter encountered unknown entities 2007-02-05 11:28:39,390 INFO crawl.SignatureFactory ( SignatureFactory.java:getSignature(45)) - Using Signature impl: org.apache.nutch.crawl.MD5Signature 2007-02-05 11:28:40,078 WARN mapred.LocalJobRunner (LocalJobRunner.java:run(120)) - job_f6j55m java.lang.NullPointerException at org.apache.nutch.parse.ParseOutputFormat$1.write( ParseOutputFormat.java:121) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write( FetcherOutputFormat.java:87) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:235) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce( IdentityReducer.java:39) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:247) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java :112) On 2/3/07, Renaud Richardet [EMAIL PROTECTED] wrote: Gal, Chris, Kauu, So, if I understand correctly, you need a way to pass information along the fetches, so that when Nutch fetches a feed entry, its item value previously fetched is available. This is how I tackled the issue: - extend Outlinks.java and allow to create outlinks with more meta data. So, in your feed parser, use this way to create outlinks - pass on the metadata through ParseOutputFormat.java and Fetcher.java - retrieve the metadata in HtmlParser.java and use it This is very tedious, will blow the size of your outlinks db, makes changes in the core code of Nutch, etc... But this is the only way I came up with... If someone sees a better way, please let me know :-) Sample code, for Nutch 0.8.x : Outlink.java + public Outlink(String toUrl, String anchor, String entryContents, Configuration conf) throws MalformedURLException { + this.toUrl = new UrlNormalizerFactory(conf).getNormalizer().normalize(toUrl); + this.anchor = anchor; + + this.entryContents= entryContents; + } and update the other methods ParseOutputFormat.java, around lines 140 +// set outlink info in metadata ME +String entryContents= links[i].getEntryContents(); + +if (entryContents.length() 0) { // it's a feed entry +MapWritable meta = new MapWritable(); +meta.put(new UTF8(entryContents), new UTF8(entryContents));//key/value +target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval); +target.setMetaData(meta); +} else { +target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval); // no meta +} Fetcher.java, around l. 266 + // add feed info to metadata + try { + String entryContents = datum.getMetaData().get(new UTF8(entryContents)).toString(); + metadata.set(entryContents, entryContents); + } catch (Exception e) { } //not found HtmlParser.java // get entry metadata String entryContents = content.getMetadata().get(entryContents); HTH, Renaud Gal Nitzan wrote: Hi Chris, I'm sorry I wasn't clear enough. What I mean is that in the current implementation: 1. The RSS (channels, items) page ends up as one Lucene document in the index. 2. Indeed the links are extracted and each item link will be fetched in the next fetch as a separate page and will end up as one Lucene document. IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Another issue that arises from rss feeds is that once the feed page is fetched you can not re-fetch it until its time to fetch expired. The feeds TTL is usually very short. Since for now in Nutch, all pages created equal :) it is one more thing to think about. HTH, Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 7:01 PM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi Gal, et al., I'd like to be explicit when we talk about what the issue with the RSS parsing plugin is here; I think we have had conversations similar to this before and it seems that we keep talking around each other. I'd like to get to the heart of this matter so that the issue (if there is an actual one) gets addressed ;) Okay, so you mention below that the thing that you see missing from the current RSS parsing plugin is the ability to store data in the CrawlDatum, and parse it in the next fetch phase. Well, there are 2 options here for what you refer to as it: 1. If you're talking about the RSS file, then in fact, it is parsed, and its data is stored in the CrawlDatum, akin to any other form of content that is fetched, parsed
Re: RSS-fecter and index individul-how can i realize this function
Gal, Chris, Kauu, So, if I understand correctly, you need a way to pass information along the fetches, so that when Nutch fetches a feed entry, its item value previously fetched is available. This is how I tackled the issue: - extend Outlinks.java and allow to create outlinks with more meta data. So, in your feed parser, use this way to create outlinks - pass on the metadata through ParseOutputFormat.java and Fetcher.java - retrieve the metadata in HtmlParser.java and use it This is very tedious, will blow the size of your outlinks db, makes changes in the core code of Nutch, etc... But this is the only way I came up with... If someone sees a better way, please let me know :-) Sample code, for Nutch 0.8.x : Outlink.java + public Outlink(String toUrl, String anchor, String entryContents, Configuration conf) throws MalformedURLException { + this.toUrl = new UrlNormalizerFactory(conf).getNormalizer().normalize(toUrl); + this.anchor = anchor; + + this.entryContents= entryContents; + } and update the other methods ParseOutputFormat.java, around lines 140 +// set outlink info in metadata ME +String entryContents= links[i].getEntryContents(); + +if (entryContents.length() 0) { // it's a feed entry +MapWritable meta = new MapWritable(); +meta.put(new UTF8(entryContents), new UTF8(entryContents));//key/value +target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval); +target.setMetaData(meta); +} else { +target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval); // no meta +} Fetcher.java, around l. 266 + // add feed info to metadata + try { + String entryContents = datum.getMetaData().get(new UTF8(entryContents)).toString(); + metadata.set(entryContents, entryContents); + } catch (Exception e) { } //not found HtmlParser.java // get entry metadata String entryContents = content.getMetadata().get(entryContents); HTH, Renaud Gal Nitzan wrote: Hi Chris, I'm sorry I wasn't clear enough. What I mean is that in the current implementation: 1. The RSS (channels, items) page ends up as one Lucene document in the index. 2. Indeed the links are extracted and each item link will be fetched in the next fetch as a separate page and will end up as one Lucene document. IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Another issue that arises from rss feeds is that once the feed page is fetched you can not re-fetch it until its time to fetch expired. The feeds TTL is usually very short. Since for now in Nutch, all pages created equal :) it is one more thing to think about. HTH, Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 7:01 PM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi Gal, et al., I'd like to be explicit when we talk about what the issue with the RSS parsing plugin is here; I think we have had conversations similar to this before and it seems that we keep talking around each other. I'd like to get to the heart of this matter so that the issue (if there is an actual one) gets addressed ;) Okay, so you mention below that the thing that you see missing from the current RSS parsing plugin is the ability to store data in the CrawlDatum, and parse it in the next fetch phase. Well, there are 2 options here for what you refer to as it: 1. If you're talking about the RSS file, then in fact, it is parsed, and its data is stored in the CrawlDatum, akin to any other form of content that is fetched, parsed and indexed. 2. If you're talking about the item links within the RSS file, in fact, they are parsed (eventually), and their data stored in the CrawlDatum, akin to any other form of content that is fetched, parsed, and indexed. This is accomplished by adding the RSS items as Outlinks when the RSS file is parsed: in this fashion, we go after all of the links in the RSS file, and make sure that we index their content as well. Thus, if you had an RSS file R that contained links in it to a PDF file A, and another HTML page P, then not only would R get fetched, parsed, and indexed, but so would A and P, because they are item links within R. Then queries that would match R (the physical RSS file), would additionally match things such as P and A, and all 3 would be capable of being returned in a Nutch query. Does this make sense? Is this the issue that you're talking about? Am I nuts? ;) Cheers, Chris On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote: Hi, Many sites provide RSS feeds for several reasons, usually to save bandwidth, to give
Re: RSS-fecter and index individul-how can i realize this function
Gal Nitzan wrote: IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Perhaps ProtocolOutput should change. The method: Content getContent(); could be deprecated and replaced with: Content[] getContents(); This would require changes to the indexing pipeline. I can't think of any severe complications, but I haven't looked closely. Could something like that work? Doug
Re: RSS-fecter and index individul-how can i realize this function
Hi Gal, et al., I'd like to be explicit when we talk about what the issue with the RSS parsing plugin is here; I think we have had conversations similar to this before and it seems that we keep talking around each other. I'd like to get to the heart of this matter so that the issue (if there is an actual one) gets addressed ;) Okay, so you mention below that the thing that you see missing from the current RSS parsing plugin is the ability to store data in the CrawlDatum, and parse it in the next fetch phase. Well, there are 2 options here for what you refer to as it: 1. If you're talking about the RSS file, then in fact, it is parsed, and its data is stored in the CrawlDatum, akin to any other form of content that is fetched, parsed and indexed. 2. If you're talking about the item links within the RSS file, in fact, they are parsed (eventually), and their data stored in the CrawlDatum, akin to any other form of content that is fetched, parsed, and indexed. This is accomplished by adding the RSS items as Outlinks when the RSS file is parsed: in this fashion, we go after all of the links in the RSS file, and make sure that we index their content as well. Thus, if you had an RSS file R that contained links in it to a PDF file A, and another HTML page P, then not only would R get fetched, parsed, and indexed, but so would A and P, because they are item links within R. Then queries that would match R (the physical RSS file), would additionally match things such as P and A, and all 3 would be capable of being returned in a Nutch query. Does this make sense? Is this the issue that you're talking about? Am I nuts? ;) Cheers, Chris On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote: Hi, Many sites provide RSS feeds for several reasons, usually to save bandwidth, to give the users concentrated data and so forth. Some of the RSS files supplied by sites are created specially for search engines where each RSS item represent a web page in the site. IMHO the only thing missing in the parse-rss plugin is storing the data in the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new flag to CrawlDatum, that would flag the URL as parsable not fetchable? Just my two cents... Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 31, 2007 8:44 AM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how
RE: RSS-fecter and index individul-how can i realize this function
Hi Chris, I'm sorry I wasn't clear enough. What I mean is that in the current implementation: 1. The RSS (channels, items) page ends up as one Lucene document in the index. 2. Indeed the links are extracted and each item link will be fetched in the next fetch as a separate page and will end up as one Lucene document. IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Another issue that arises from rss feeds is that once the feed page is fetched you can not re-fetch it until its time to fetch expired. The feeds TTL is usually very short. Since for now in Nutch, all pages created equal :) it is one more thing to think about. HTH, Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 7:01 PM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi Gal, et al., I'd like to be explicit when we talk about what the issue with the RSS parsing plugin is here; I think we have had conversations similar to this before and it seems that we keep talking around each other. I'd like to get to the heart of this matter so that the issue (if there is an actual one) gets addressed ;) Okay, so you mention below that the thing that you see missing from the current RSS parsing plugin is the ability to store data in the CrawlDatum, and parse it in the next fetch phase. Well, there are 2 options here for what you refer to as it: 1. If you're talking about the RSS file, then in fact, it is parsed, and its data is stored in the CrawlDatum, akin to any other form of content that is fetched, parsed and indexed. 2. If you're talking about the item links within the RSS file, in fact, they are parsed (eventually), and their data stored in the CrawlDatum, akin to any other form of content that is fetched, parsed, and indexed. This is accomplished by adding the RSS items as Outlinks when the RSS file is parsed: in this fashion, we go after all of the links in the RSS file, and make sure that we index their content as well. Thus, if you had an RSS file R that contained links in it to a PDF file A, and another HTML page P, then not only would R get fetched, parsed, and indexed, but so would A and P, because they are item links within R. Then queries that would match R (the physical RSS file), would additionally match things such as P and A, and all 3 would be capable of being returned in a Nutch query. Does this make sense? Is this the issue that you're talking about? Am I nuts? ;) Cheers, Chris On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote: Hi, Many sites provide RSS feeds for several reasons, usually to save bandwidth, to give the users concentrated data and so forth. Some of the RSS files supplied by sites are created specially for search engines where each RSS item represent a web page in the site. IMHO the only thing missing in the parse-rss plugin is storing the data in the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new flag to CrawlDatum, that would flag the URL as parsable not fetchable? Just my two cents... Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 31, 2007 8:44 AM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely
Re: RSS-fecter and index individul-how can i realize this function
hi all, what Gal said is just my meaning on the rss-parse need. i just want to fetch rss seeds once, On 2/2/07, Gal Nitzan [EMAIL PROTECTED] wrote: Hi Chris, I'm sorry I wasn't clear enough. What I mean is that in the current implementation: 1. The RSS (channels, items) page ends up as one Lucene document in the index. 2. Indeed the links are extracted and each item link will be fetched in the next fetch as a separate page and will end up as one Lucene document. IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Another issue that arises from rss feeds is that once the feed page is fetched you can not re-fetch it until its time to fetch expired. The feeds TTL is usually very short. Since for now in Nutch, all pages created equal :) it is one more thing to think about. HTH, Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 7:01 PM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi Gal, et al., I'd like to be explicit when we talk about what the issue with the RSS parsing plugin is here; I think we have had conversations similar to this before and it seems that we keep talking around each other. I'd like to get to the heart of this matter so that the issue (if there is an actual one) gets addressed ;) Okay, so you mention below that the thing that you see missing from the current RSS parsing plugin is the ability to store data in the CrawlDatum, and parse it in the next fetch phase. Well, there are 2 options here for what you refer to as it: 1. If you're talking about the RSS file, then in fact, it is parsed, and its data is stored in the CrawlDatum, akin to any other form of content that is fetched, parsed and indexed. 2. If you're talking about the item links within the RSS file, in fact, they are parsed (eventually), and their data stored in the CrawlDatum, akin to any other form of content that is fetched, parsed, and indexed. This is accomplished by adding the RSS items as Outlinks when the RSS file is parsed: in this fashion, we go after all of the links in the RSS file, and make sure that we index their content as well. Thus, if you had an RSS file R that contained links in it to a PDF file A, and another HTML page P, then not only would R get fetched, parsed, and indexed, but so would A and P, because they are item links within R. Then queries that would match R (the physical RSS file), would additionally match things such as P and A, and all 3 would be capable of being returned in a Nutch query. Does this make sense? Is this the issue that you're talking about? Am I nuts? ;) Cheers, Chris On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote: Hi, Many sites provide RSS feeds for several reasons, usually to save bandwidth, to give the users concentrated data and so forth. Some of the RSS files supplied by sites are created specially for search engines where each RSS item represent a web page in the site. IMHO the only thing missing in the parse-rss plugin is storing the data in the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new flag to CrawlDatum, that would flag the URL as parsable not fetchable? Just my two cents... Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 31, 2007 8:44 AM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category
Re: RSS-fecter and index individul-how can i realize this function
hi , thx any way , but i don't think I tell clearly enough. what i want is nutch just fetch rss seeds for 1 depth. So nutch should just fetch some xml pages .I don't want to fetch the items' outlink 's pages, because there r too much spam in those pages. so , i just need to parse the rss file. so when i search some words which in description tag in one xml's item. the return hit will be like this title ==one item's title summary ==one item's description link ==one itme's outlink. so , i don't know whether the parse-rss plugin provide this function? On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I've check the code of the plug-in protocol-http's code ,but I can't find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM's structure item title欧洲暴风雪后发制人 致航班 延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1 月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部 的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item -- www.babatu.com -- www.babatu.com
RE: RSS-fecter and index individul-how can i realize this function
Hi, Many sites provide RSS feeds for several reasons, usually to save bandwidth, to give the users concentrated data and so forth. Some of the RSS files supplied by sites are created specially for search engines where each RSS item represent a web page in the site. IMHO the only thing missing in the parse-rss plugin is storing the data in the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new flag to CrawlDatum, that would flag the URL as parsable not fetchable? Just my two cents... Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 31, 2007 8:44 AM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I've check the code of the plug-in protocol-http's code ,but I can't find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM's structure item title欧洲暴风雪后发制人 致航班 延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1 月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部 的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item -- www.babatu.com
RSS-fecter and index individul-how can i realize this function
Hi folks : What’s I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item’s info as a individual hit. What’s my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM’s link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I’ve check the code of the plug-in protocol-http’s code ,but I can’t find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM’s structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item
Re: RSS-fecter and index individul-how can i realize this function
Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What’s I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item’s info as a individual hit. What’s my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM’s link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I’ve check the code of the plug-in protocol-http’s code ,but I can’t find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM’s structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item
Re: RSS-fecter and index individul-how can i realize this function
thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I've check the code of the plug-in protocol-http's code ,but I can't find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM's structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item -- www.babatu.com
Re: RSS-fecter and index individul-how can i realize this function
Hi there, On 1/30/07 7:00 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Chris, I saw your name associated with the rss parser in nutch. My understanding is that nutch is using feedparser. I had two questions: 1. Have you looked at vtd as an rss parser? I haven't in fact; what are its benefits over those of commons-feedparser? 2. Any view on asynchronous communication as the underlying protocol? I do not believe that feedparser uses that at this point. I'm not sure exactly what asynchronous communication when parsing rss feeds affords you: what type of communications are you talking about above? Nutch handles the communications layer for fetching content using a pluggable, Protocol-based model. The only feature that Nutch's rss parser uses from the underlying feedparser library is its object model and callback framework for parsing RSS/Atom/Feed XML documents. When you mention asynchronous above, are you talking about the protocol for fetching the different RSS documents? Thanks! Cheers, Chris Thanks -Original Message- From: Chris Mattmann [EMAIL PROTECTED] Date: Tue, 30 Jan 2007 18:16:44 To:nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What’s I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item’s info as a individual hit. What’s my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM’s link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I’ve check the code of the plug-in protocol-http’s code ,but I can’t find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM’s structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item
Re: RSS-fecter and index individul-how can i realize this function
1. Claims to be faster 2. Asynchronous should take care of sitting and waiting for one fetch to return before you do the next. Ps I am not sure if you checked out tailrank.com for that branch of feedparser (I think its in code.tailrank.com/feedparser) Thanks -Original Message- From: Chris Mattmann [EMAIL PROTECTED] Date: Tue, 30 Jan 2007 19:34:49 To:nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, On 1/30/07 7:00 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Chris, I saw your name associated with the rss parser in nutch. My understanding is that nutch is using feedparser. I had two questions: 1. Have you looked at vtd as an rss parser? I haven't in fact; what are its benefits over those of commons-feedparser? 2. Any view on asynchronous communication as the underlying protocol? I do not believe that feedparser uses that at this point. I'm not sure exactly what asynchronous communication when parsing rss feeds affords you: what type of communications are you talking about above? Nutch handles the communications layer for fetching content using a pluggable, Protocol-based model. The only feature that Nutch's rss parser uses from the underlying feedparser library is its object model and callback framework for parsing RSS/Atom/Feed XML documents. When you mention asynchronous above, are you talking about the protocol for fetching the different RSS documents? Thanks! Cheers, Chris Thanks -Original Message- From: Chris Mattmann [EMAIL PROTECTED] Date: Tue, 30 Jan 2007 18:16:44 To:nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What’s I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item’s info as a individual hit. What’s my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM’s link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I’ve check the code of the plug-in protocol-http’s code ,but I can’t find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM’s structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item
Re: RSS-fecter and index individul-how can i realize this function
Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I've check the code of the plug-in protocol-http's code ,but I can't find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM's structure item title欧洲暴风雪后发制人 致航班 延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1 月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部 的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item -- www.babatu.com