Re: RSS-fecter and index individul-how can i realize this function

2009-01-05 Thread Doğacan Güney
On Mon, Jan 5, 2009 at 7:00 AM, Vlad Cananau vlad...@gmail.com wrote:
 Hello
 I'm trying to make RSSParser do something simmilar to FeedParser (which
 doesn't work quite right) - that is, instead of indexing the whole contents

Why doesn't FeedParser work? Let's fix whatever is broken in it :D

 of the feed, I want it to show individual items, with their respective title
 and and proper link to the article I realize that I could index 1 depth
 more, but I'd like to index just the feed, not the articles that go with it
 (keep the index small and the crawl fast).

 For each item in each RSS channel (the code does not differ much for
 getParse() of RSSParser.java) I do something like

  Outlink[] outlinks = new Outlink[1];
  try{
   outlinks[0] = new Outlink(whichLink, theRSSItem.getTitle());
  } catch (Exception e) {
   continue;
  }

  parseResult.put(
   whichLink,
   new ParseText(theRSSItem.getTitle() + theRSSItem.getDescription()),
   new ParseData(
 ParseStatus.STATUS_SUCCESS,
 theRSSItem.getTitle(),
 outlinks,
 new Metadata() //was content.getMetadata()
   )
  );

 The problem is, however, that only one item from the whole RSS gets into the
 index, although in the log I can see them all ( I've tried it with feeds
 from cnn and reuters). What happens? Why do they get overwritten in a
 seemingly random order? The item that makes it into the index is neither the
 first nor the last, but appears to be the same until new items appear in the
 feed.

 Thank you,
 Vlad





-- 
Doğacan Güney


Re: RSS-fecter and index individul-how can i realize this function

2009-01-05 Thread Vlad Cananau
On Mon, Jan 5, 2009 at 12:32 PM, Doğacan Güney doga...@gmail.com wrote:
 On Mon, Jan 5, 2009 at 7:00 AM, Vlad Cananau vlad...@gmail.com wrote:
 Hello
 I'm trying to make RSSParser do something simmilar to FeedParser (which
 doesn't work quite right) - that is, instead of indexing the whole contents

 Why doesn't FeedParser work? Let's fix whatever is broken in it :D

 of the feed, I want it to show individual items, with their respective title
 and and proper link to the article I realize that I could index 1 depth
 more, but I'd like to index just the feed, not the articles that go with it
 (keep the index small and the crawl fast).

 For each item in each RSS channel (the code does not differ much for
 getParse() of RSSParser.java) I do something like

  Outlink[] outlinks = new Outlink[1];
  try{
   outlinks[0] = new Outlink(whichLink, theRSSItem.getTitle());
  } catch (Exception e) {
   continue;
  }

  parseResult.put(
   whichLink,
   new ParseText(theRSSItem.getTitle() + theRSSItem.getDescription()),
   new ParseData(
 ParseStatus.STATUS_SUCCESS,
 theRSSItem.getTitle(),
 outlinks,
 new Metadata() //was content.getMetadata()
   )
  );

 The problem is, however, that only one item from the whole RSS gets into the
 index, although in the log I can see them all ( I've tried it with feeds
 from cnn and reuters). What happens? Why do they get overwritten in a
 seemingly random order? The item that makes it into the index is neither the
 first nor the last, but appears to be the same until new items appear in the
 feed.

 Thank you,
 Vlad





 --
 Doğacan Güney


when using FeedParser, not all of the feeds make it into the index.
For example, I crawl both Entertainment and Politics, but I get
results only for some of the articles.

Is there any way to check wether or not entries make it into the index?
I see, in the log Indexing http://rss.cnn.com/... with analyzer
org.apache.nutch.analyzer.NutchDocumentAnalyzer (something) (I'm not
able to crawl right now, since I don't have access to the machine).
But when I look for keywords specific to some of the documents, I
don't get any results :-(


Re: RSS-fecter and index individul-how can i realize this function

2009-01-05 Thread Vlad Cananau

Doğacan Güney wrote:

On Mon, Jan 5, 2009 at 7:00 AM, Vlad Cananau vlad...@gmail.com wrote:
  

Hello
I'm trying to make RSSParser do something simmilar to FeedParser (which
doesn't work quite right) - that is, instead of indexing the whole contents



Why doesn't FeedParser work? Let's fix whatever is broken in it :D

  

of the feed, I want it to show individual items, with their respective title
and and proper link to the article I realize that I could index 1 depth
more, but I'd like to index just the feed, not the articles that go with it
(keep the index small and the crawl fast).

For each item in each RSS channel (the code does not differ much for
getParse() of RSSParser.java) I do something like

 Outlink[] outlinks = new Outlink[1];
 try{
  outlinks[0] = new Outlink(whichLink, theRSSItem.getTitle());
 } catch (Exception e) {
  continue;
 }

 parseResult.put(
  whichLink,
  new ParseText(theRSSItem.getTitle() + theRSSItem.getDescription()),
  new ParseData(
ParseStatus.STATUS_SUCCESS,
theRSSItem.getTitle(),
outlinks,
new Metadata() //was content.getMetadata()
  )
 );

The problem is, however, that only one item from the whole RSS gets into the
index, although in the log I can see them all ( I've tried it with feeds
from cnn and reuters). What happens? Why do they get overwritten in a
seemingly random order? The item that makes it into the index is neither the
first nor the last, but appears to be the same until new items appear in the
feed.

Thank you,
Vlad







  
In order to show you what I mean by only one item gets into the index, 
check out these results
http://tinyurl.com/7hkkoo*http://tinyurl.com/7hkkoo [link 
http://vladk2k.homeip.net:8080 - my own server]*


Re: RSS-fecter and index individul-how can i realize this function

2009-01-04 Thread Vlad Cananau
Hello
I'm trying to make RSSParser do something simmilar to FeedParser (which
doesn't work quite right) - that is, instead of indexing the whole contents
of the feed, I want it to show individual items, with their respective title
and and proper link to the article I realize that I could index 1 depth
more, but I'd like to index just the feed, not the articles that go with it
(keep the index small and the crawl fast).

For each item in each RSS channel (the code does not differ much for
getParse() of RSSParser.java) I do something like

 Outlink[] outlinks = new Outlink[1];
 try{
  outlinks[0] = new Outlink(whichLink, theRSSItem.getTitle());
 } catch (Exception e) {
  continue;
 }

 parseResult.put(
  whichLink,
  new ParseText(theRSSItem.getTitle() + theRSSItem.getDescription()),
  new ParseData(
ParseStatus.STATUS_SUCCESS,
theRSSItem.getTitle(),
outlinks,
new Metadata() //was content.getMetadata()
  )
 );

The problem is, however, that only one item from the whole RSS gets into the
index, although in the log I can see them all ( I've tried it with feeds
from cnn and reuters). What happens? Why do they get overwritten in a
seemingly random order? The item that makes it into the index is neither the
first nor the last, but appears to be the same until new items appear in the
feed.

Thank you,
Vlad


Re: RSS-fecter and index individul-how can i realize this function

2008-12-03 Thread mirkes

Where can I find Scott's solution? I am trying to do it exactly like Scott,
but i cannot imagine how to index items separately.
Please, can anybody help me? 

Many thanks

Miro


sdeck wrote:
 
 So, here is what I do for RSS Feeds.
 
 I parse the rss, and for each outlink, I create the outlink object and set
 inside the anchor text for each outlink a well formed xml string. It
 contains the pub date, description, etc. Now, this is only because I was
 hacking the outlink to just use it's anchor text, but you could always
 just create a new MetaData object for use with an outlink. So, then next
 time that url is called up, and you then get an html parser, then you
 could look at the outlinks metadata and say, hey, look you came from an
 rss feed. So, I can either just use your stored Metadata and not parse the
 html, or I could combine your meta data with what comes from the html,
 etc.
 I have found that to be the best solutions
 
 Also, when I parse the rss feed, I set a meat tag called noindex, so in
 my basic indexer, if that is in there, I do not include the rss feed page
 in the Lucene index.
 
 Scott
 
 
 
 
 Doug Cutting wrote:
 
 Chris Mattmann wrote:
  Got it. So, the logic behind this is, why bother waiting until the
 following fetch to parse (and create ParseData objects from) the RSS
 items
 out of the feed. Okay, I get it, assuming that the RSS feed has *all* of
 the
 RSS metadata in it. However, it's perfectly acceptable to have feeds
 that
 simply have a title, description, and link in it.
 
 Almost.  The feed may have less than the referenced page, but it's also 
 a lot easier to parse, since the link could be an anchor within a large 
 page, or could be a page that has lots of navigation links, spam 
 comments, etc.  So feed entries are generally much more precise than the 
 pages they reference, and may make for a higher-quality search
 experience.
 
 I guess this is still
 valuable metadata information to have, however, the only caveat is that
 the
 implication of the proposed change is:
 
 1. We won't have cached copies, or fetched copies of the Content
 represented
 by the item links. Therefore, in this model, we won't be able to pull up
 a
 Nutch cache of the page corresponding to the RSS item, because we are
 circumventing the fetch step
 
 Good point.  We indeed wouldn't have these URLs in the cache.
 
 2. It sounds like a pretty fundamental API shift in Nutch, to support a
 single type of content, RSS. Even if there are more content types that
 follow this model, as Doug and Renaud both pointed out, there aren't a
 multitude of them (perhaps archive files, but can you think of any
 others)?
 
 Also true.  On the other hand, Nutch provides 98% of an RSS search 
 engine.  It'd be a shame to have to re-invent everything else and it 
 would be great if Nutch could evolve to support RSS well.
 
 Could image search might also benefit from this?  One could generate a 
 Parse for each image on a page whose text was from the page.  Product 
 search too, perhaps.
 
 The other main thing that comes to mind about this for me is it prevents
 the
 fetched Content for the RSS items from being able to provide useful
 metadata, in the sense that it doesn't explicitly fetch the content.
 What if
 we wanted to apply some super cool metadata extractor X that used
 word-stemming, HTML design analysis, and other techniques to extract
 metadata from the content pointed to by an RSS item link? In the
 proposed
 model, we assume that the RSS xml item tag already contains all
 necessary
 metadata for indexing, which in my mind, limits the model. Does what I
 am
 saying make sense? I'm not shooting down the issue, I'm just trying to
 brainstorm a bit here about the issue.
 
 Sure, the RSS feed may contain less than the page it references, but 
 that might be all that one wishes to index.  Otherwise, if, e.g., a blog 
   includes titles from other recent posts you're going to get lots of 
 false positives.  Ideally Nutch should support various options: 
 searching the feed only, searching the referenced page only, or perhaps 
 searching both.
 
 Doug
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tp8722009p20815016.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.



RE: RSS-fecter and index individul-how can i realize this function

2007-02-08 Thread Alan Tanaman
 2. It sounds like a pretty fundamental API shift in Nutch, to support a
 single type of content, RSS. Even if there are more content types that
 follow this model, as Doug and Renaud both pointed out, there aren't a
 multitude of them (perhaps archive files, but can you think of any
others)?

 Also true.  On the other hand, Nutch provides 98% of an RSS search 
 engine.  It'd be a shame to have to re-invent everything else and it 
 would be great if Nutch could evolve to support RSS well.

 Could image search might also benefit from this?  One could generate a 
 Parse for each image on a page whose text was from the page.  Product 
 search too, perhaps.

Another application could be splitting certain enterprise documents up,
either based on passage retrieval algorithms or simply based on the table of
content entries.  For example, a long contract or user guide could be split
up into separate searchable documents.

Best regards,
Alan
_
Alan Tanaman
iDNA Solutions
http://blog.idna-solutions.com




Re: RSS-fecter and index individul-how can i realize this function

2007-02-08 Thread Chris Mattmann
Hi Doug,

  Okay, I see your points. It seems like this would be really useful for
some current folks, and for Nutch going forward. I see that there has been
some initial work today and preparing patches. I'd be happy to shepherd this
into the sources. I will begin reviewing what's required, and contacting the
folks who've begun work on this issue.

Thanks!

Cheers,
  Chris



On 2/7/07 1:31 PM, Doug Cutting [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
  Got it. So, the logic behind this is, why bother waiting until the
 following fetch to parse (and create ParseData objects from) the RSS items
 out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the
 RSS metadata in it. However, it's perfectly acceptable to have feeds that
 simply have a title, description, and link in it.
 
 Almost.  The feed may have less than the referenced page, but it's also
 a lot easier to parse, since the link could be an anchor within a large
 page, or could be a page that has lots of navigation links, spam
 comments, etc.  So feed entries are generally much more precise than the
 pages they reference, and may make for a higher-quality search experience.
 
 I guess this is still
 valuable metadata information to have, however, the only caveat is that the
 implication of the proposed change is:
 
 1. We won't have cached copies, or fetched copies of the Content represented
 by the item links. Therefore, in this model, we won't be able to pull up a
 Nutch cache of the page corresponding to the RSS item, because we are
 circumventing the fetch step
 
 Good point.  We indeed wouldn't have these URLs in the cache.
 
 2. It sounds like a pretty fundamental API shift in Nutch, to support a
 single type of content, RSS. Even if there are more content types that
 follow this model, as Doug and Renaud both pointed out, there aren't a
 multitude of them (perhaps archive files, but can you think of any others)?
 
 Also true.  On the other hand, Nutch provides 98% of an RSS search
 engine.  It'd be a shame to have to re-invent everything else and it
 would be great if Nutch could evolve to support RSS well.
 
 Could image search might also benefit from this?  One could generate a
 Parse for each image on a page whose text was from the page.  Product
 search too, perhaps.
 
 The other main thing that comes to mind about this for me is it prevents the
 fetched Content for the RSS items from being able to provide useful
 metadata, in the sense that it doesn't explicitly fetch the content. What if
 we wanted to apply some super cool metadata extractor X that used
 word-stemming, HTML design analysis, and other techniques to extract
 metadata from the content pointed to by an RSS item link? In the proposed
 model, we assume that the RSS xml item tag already contains all necessary
 metadata for indexing, which in my mind, limits the model. Does what I am
 saying make sense? I'm not shooting down the issue, I'm just trying to
 brainstorm a bit here about the issue.
 
 Sure, the RSS feed may contain less than the page it references, but
 that might be all that one wishes to index.  Otherwise, if, e.g., a blog
   includes titles from other recent posts you're going to get lots of
 false positives.  Ideally Nutch should support various options:
 searching the feed only, searching the referenced page only, or perhaps
 searching both.
 
 Doug




FW: RSS-fecter and index individul-how can i realize this function

2007-02-08 Thread HUYLEBROECK Jeremy RD-ILAB-SSF

I send again this message as it apparently didn't go through.
(I am messing up with my email addresses on the mailing list...) 

-Original Message-
Sent: Friday, February 02, 2007 10:29 AM

Using Nutch 0.8, we modified the code starting at the fetching/parsing steps 
and the following.
We have a different implementation of the Parse Object and OutputFormat 
including an additional list of ParseData objects saved in an additionnal 
subfolder in the DFS.
We changed the indexing step a lot too, so we don't use the nutch code there.


-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Friday, February 02, 2007 10:19 AM
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Attention, votre correspondant continue de vous écrire à votre ancienne adresse 
en @orange-ft.com, qui va être désactivée début avril. Veuillez lui demander de 
mettre à jour son carnet d'adresses avec votre nouvelle adresse en 
@orange-ftgroup.com.

Caution : your correspondent is still writing to your orange-ft.com address, 
which will be disabled beginning of April. Please ask him/her to update his/her 
address book to orange-ftgroup.com 
..

Gal Nitzan wrote:
 IMHO the data that is needed i.e. the data that will be fetched in the next 
 fetch process is already available in the item element. Each item element 
 represents one web resource. And there is no reason to go to the server and 
 re-fetch that resource.

Perhaps ProtocolOutput should change.  The method:

   Content getContent();

could be deprecated and replaced with:

   Content[] getContents();

This would require changes to the indexing pipeline.  I can't think of

any severe complications, but I haven't looked closely.

Could something like that work?

Doug



Re: FW: RSS-fecter and index individul-how can i realize this function

2007-02-08 Thread Renaud Richardet

HUYLEBROECK Jeremy RD-ILAB-SSF wrote:

I send again this message as it apparently didn't go through.
(I am messing up with my email addresses on the mailing list...) 


-Original Message-
Sent: Friday, February 02, 2007 10:29 AM

Using Nutch 0.8, we modified the code starting at the fetching/parsing steps 
and the following.
We have a different implementation of the Parse Object and OutputFormat 
including an additional list of ParseData objects saved in an additionnal 
subfolder in the DFS.
We changed the indexing step a lot too, so we don't use the nutch code there.
  
Is your implementation similar to what we started at 
https://issues.apache.org/jira/browse/NUTCH-443? If you think some of 
your changes could be integrated, please post a patch there.


Thanks for sharing,
Renaud


-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Friday, February 02, 2007 10:19 AM
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Attention, votre correspondant continue de vous écrire à votre ancienne adresse 
en @orange-ft.com, qui va être désactivée début avril. Veuillez lui demander de 
mettre à jour son carnet d'adresses avec votre nouvelle adresse en 
@orange-ftgroup.com.

Caution : your correspondent is still writing to your orange-ft.com address, 
which will be disabled beginning of April. Please ask him/her to update his/her 
address book to orange-ftgroup.com 
..

Gal Nitzan wrote:
  

IMHO the data that is needed i.e. the data that will be fetched in the next fetch process 
is already available in the item element. Each item element represents one 
web resource. And there is no reason to go to the server and re-fetch that resource.



Perhaps ProtocolOutput should change.  The method:

   Content getContent();

could be deprecated and replaced with:

   Content[] getContents();

This would require changes to the indexing pipeline.  I can't think of

any severe complications, but I haven't looked closely.

Could something like that work?

Doug


  



--
Renaud Richardet  +1 617 230 9112
my email is my first name at apache.org  http://www.oslutions.com



Re: RSS-fecter and index individul-how can i realize this function

2007-02-08 Thread sdeck

So, here is what I do for RSS Feeds.

I parse the rss, and for each outlink, I create the outlink object and set
inside the anchor text for each outlink a well formed xml string. It
contains the pub date, description, etc. Now, this is only because I was
hacking the outlink to just use it's anchor text, but you could always just
create a new MetaData object for use with an outlink. So, then next time
that url is called up, and you then get an html parser, then you could look
at the outlinks metadata and say, hey, look you came from an rss feed. So, I
can either just use your stored Metadata and not parse the html, or I could
combine your meta data with what comes from the html, etc.
I have found that to be the best solutions

Also, when I parse the rss feed, I set a meat tag called noindex, so in my
basic indexer, if that is in there, I do not include the rss feed page in
the Lucene index.

Scott




Doug Cutting wrote:
 
 Chris Mattmann wrote:
  Got it. So, the logic behind this is, why bother waiting until the
 following fetch to parse (and create ParseData objects from) the RSS
 items
 out of the feed. Okay, I get it, assuming that the RSS feed has *all* of
 the
 RSS metadata in it. However, it's perfectly acceptable to have feeds that
 simply have a title, description, and link in it.
 
 Almost.  The feed may have less than the referenced page, but it's also 
 a lot easier to parse, since the link could be an anchor within a large 
 page, or could be a page that has lots of navigation links, spam 
 comments, etc.  So feed entries are generally much more precise than the 
 pages they reference, and may make for a higher-quality search experience.
 
 I guess this is still
 valuable metadata information to have, however, the only caveat is that
 the
 implication of the proposed change is:
 
 1. We won't have cached copies, or fetched copies of the Content
 represented
 by the item links. Therefore, in this model, we won't be able to pull up
 a
 Nutch cache of the page corresponding to the RSS item, because we are
 circumventing the fetch step
 
 Good point.  We indeed wouldn't have these URLs in the cache.
 
 2. It sounds like a pretty fundamental API shift in Nutch, to support a
 single type of content, RSS. Even if there are more content types that
 follow this model, as Doug and Renaud both pointed out, there aren't a
 multitude of them (perhaps archive files, but can you think of any
 others)?
 
 Also true.  On the other hand, Nutch provides 98% of an RSS search 
 engine.  It'd be a shame to have to re-invent everything else and it 
 would be great if Nutch could evolve to support RSS well.
 
 Could image search might also benefit from this?  One could generate a 
 Parse for each image on a page whose text was from the page.  Product 
 search too, perhaps.
 
 The other main thing that comes to mind about this for me is it prevents
 the
 fetched Content for the RSS items from being able to provide useful
 metadata, in the sense that it doesn't explicitly fetch the content. What
 if
 we wanted to apply some super cool metadata extractor X that used
 word-stemming, HTML design analysis, and other techniques to extract
 metadata from the content pointed to by an RSS item link? In the proposed
 model, we assume that the RSS xml item tag already contains all necessary
 metadata for indexing, which in my mind, limits the model. Does what I am
 saying make sense? I'm not shooting down the issue, I'm just trying to
 brainstorm a bit here about the issue.
 
 Sure, the RSS feed may contain less than the page it references, but 
 that might be all that one wishes to index.  Otherwise, if, e.g., a blog 
   includes titles from other recent posts you're going to get lots of 
 false positives.  Ideally Nutch should support various options: 
 searching the feed only, searching the referenced page only, or perhaps 
 searching both.
 
 Doug
 
 

-- 
View this message in context: 
http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html#a8876127
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doug Cutting

Renaud Richardet wrote:
I see. I was thinking that I could index the feed items without having 
to fetch them individually.


Okay, so if Parser#parse returned a MapString,Parse, then the URL for 
each parse should be that of its link, since you don't want to fetch 
that separately.  Right?


So now the question is, how much impact would this change to the Parser 
API have on the rest of Nutch?  It would require changes to all Parser 
implementations, to ParseSegement, to ParseUtil, and to Fetcher.  But, 
as far as I can tell, most of these changes look straightforward.


Doug


Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Chris Mattmann
Guys,

 Sorry to be so thick-headed, but could someone explain to me in really
simple language what this change is requesting that is different from the
current Nutch API? I still don't get it, sorry...

Cheers,
  Chris



On 2/7/07 9:58 AM, Doug Cutting [EMAIL PROTECTED] wrote:

 Renaud Richardet wrote:
 I see. I was thinking that I could index the feed items without having
 to fetch them individually.
 
 Okay, so if Parser#parse returned a MapString,Parse, then the URL for
 each parse should be that of its link, since you don't want to fetch
 that separately.  Right?
 
 So now the question is, how much impact would this change to the Parser
 API have on the rest of Nutch?  It would require changes to all Parser
 implementations, to ParseSegement, to ParseUtil, and to Fetcher.  But,
 as far as I can tell, most of these changes look straightforward.
 
 Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Renaud Richardet

Doug Cutting wrote:

Renaud Richardet wrote:
I see. I was thinking that I could index the feed items without 
having to fetch them individually.


Okay, so if Parser#parse returned a MapString,Parse, then the URL 
for each parse should be that of its link, since you don't want to 
fetch that separately. Right?

Exactly.


So now the question is, how much impact would this change to the 
Parser API have on the rest of Nutch? It would require changes to all 
Parser implementations, to ParseSegement, to ParseUtil, and to 
Fetcher. But, as far as I can tell, most of these changes look 
straightforward.
I think so, too. I have opened an issue in JIRA 
(https://issues.apache.org/jira/browse/NUTCH-443) and will give it a try.

Doğacan, have you started working on it yet?

Thanks,
Renaud



Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doug Cutting

Chris Mattmann wrote:

 Sorry to be so thick-headed, but could someone explain to me in really
simple language what this change is requesting that is different from the
current Nutch API? I still don't get it, sorry...


A Content would no longer generate a single Parse.  Instead, a Content 
could potentially generate many Parses.  For most types of content, 
e.g., HTML, each Content would still generate a single Parse.  But for 
RSS, a Content might generate multiple Parses, each indexed separately 
and each with a distinct URL.


Another potential application could be processing archives: the parser 
could unpack the archive and each item in it indexed separately rather 
than indexing the archive as a whole.  This only makes sense if each 
item has a distinct URL, which it does in RSS, but it might not in an 
archive.  However some archive file formats do contain URLs, like that 
used by the Internet Archive.


http://www.archive.org/web/researcher/ArcFileFormat.php

Does that help?

Doug


Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Sami Siren
 Also true.  On the other hand, Nutch provides 98% of an RSS search
 engine.  It'd be a shame to have to re-invent everything else and it
 would be great if Nutch could evolve to support RSS well.
 
 Could image search might also benefit from this?  One could generate a
 Parse for each image on a page whose text was from the page.  Product
 search too, perhaps.

These are excellent points I am totally +1 for the api change, it opens
doors for a lot of new possible applications.

--
 Sami Siren


Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doğacan Güney
Renaud Richardet wrote:
 Doug Cutting wrote:
 Renaud Richardet wrote:
 I see. I was thinking that I could index the feed items without
 having to fetch them individually.

 Okay, so if Parser#parse returned a MapString,Parse, then the URL
 for each parse should be that of its link, since you don't want to
 fetch that separately. Right?
 Exactly.

 So now the question is, how much impact would this change to the
 Parser API have on the rest of Nutch? It would require changes to all
 Parser implementations, to ParseSegement, to ParseUtil, and to
 Fetcher. But, as far as I can tell, most of these changes look
 straightforward.
 I think so, too. I have opened an issue in JIRA
 (https://issues.apache.org/jira/browse/NUTCH-443) and will give it a try.
 Doğacan, have you started working on it yet?

I have just started working on it. I hope I will have something (at
least a patch for
everything but plugins) within the day.

--
Doğacan Güney


 Thanks,
 Renaud







Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Doğacan Güney
Hi,

Doug Cutting wrote:
 Doğacan Güney wrote:
 I think it would make much more sense to change parse plugins to take
 content and return Parse[] instead of Parse.

 You're right.  That does make more sense.

OK, then should I go forward with this and implement something?   This
should be pretty easy,
though I am not sure what to give as keys to a Parse[].

I mean, when getParse returned a single Parse, ParseSegment output them
as url, Parse. But, if getParse
returns an array, what will be the key for each element?

Something like url#i, Parse[i] may work, but this may cause problems
in dedup(for example,
assume we fetched the same rss feed twice, and indexed them in different
indexes. Two version's url#0 may be
different items but since they have the same key, dedup will delete the
older).

--
Doğacan Güney


 Doug






Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Gal Nitzan
Hi,

IMO it should stay the same.

URL as the key and in the filter each item link element becomes the key.

I will be happy to convert the current parse-rss filter to the suggested
implementation.

Gal.

-- Original Message --
Received: Tue, 06 Feb 2007 10:36:03 AM IST
From: Doğacan Güney [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

 Hi,
 
 Doug Cutting wrote:
  Doğacan Güney wrote:
  I think it would make much more sense to change parse plugins to take
  content and return Parse[] instead of Parse.
 
  You're right.  That does make more sense.
 
 OK, then should I go forward with this and implement something?   This
 should be pretty easy,
 though I am not sure what to give as keys to a Parse[].
 
 I mean, when getParse returned a single Parse, ParseSegment output them
 as url, Parse. But, if getParse
 returns an array, what will be the key for each element?
 
 Something like url#i, Parse[i] may work, but this may cause problems
 in dedup(for example,
 assume we fetched the same rss feed twice, and indexed them in different
 indexes. Two version's url#0 may be
 different items but since they have the same key, dedup will delete the
 older).
 
 --
 Doğacan Güney
 
 
  Doug
 
 
 
 
 





Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Doug Cutting

Doğacan Güney wrote:

OK, then should I go forward with this and implement something?   This
should be pretty easy,
though I am not sure what to give as keys to a Parse[].

I mean, when getParse returned a single Parse, ParseSegment output them
as url, Parse. But, if getParse
returns an array, what will be the key for each element?


Perhaps Parser#parser could return a MapString,Parse, where the keys 
are URLs?



Something like url#i, Parse[i] may work, but this may cause problems
in dedup(for example,
assume we fetched the same rss feed twice, and indexed them in different
indexes. Two version's url#0 may be
different items but since they have the same key, dedup will delete the
older).


If the feed contains unique ids for items, then that can be used to 
qualify the URL.  Otherwise one could use the hash of the link of the item.


Since the target of the link must still be indexed separately from the 
item itself, how much use is all this?  If the RSS document is 
considered a single page that changes frequently, and item's links are 
considered ordinary outlinks, isn't much the same effect achieved?


Doug


Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Chris Mattmann
Hi Doug,

 Since the target of the link must still be indexed separately from the
 item itself, how much use is all this?  If the RSS document is
 considered a single page that changes frequently, and item's links are
 considered ordinary outlinks, isn't much the same effect achieved?

IMHO, yes. That's what it's been hard for me to understand the real use case
for what Gal et al. are talking about. I've been trying to wrap my head
around it, but it seems to me the capability they require is sort of already
provided...

Cheers,
  Chris

 
 Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Renaud Richardet

Hi Chris, Doug,

Chris Mattmann wrote:

Hi Doug,

  

Since the target of the link must still be indexed separately from the
item itself, how much use is all this?  If the RSS document is
considered a single page that changes frequently, and item's links are
considered ordinary outlinks, isn't much the same effect achieved?



IMHO, yes. That's what it's been hard for me to understand the real use case
for what Gal et al. are talking about. I've been trying to wrap my head
around it, but it seems to me the capability they require is sort of already
provided...
  
Not sure I understand: An RSS-feed is a collection of feed-entries, and 
each feed-entry would be indexed a a separate document (each feed-entry 
has a url or uuid as unique identifier).
What happens with the RSS-feed itself? Is it indexed, or considered as a 
container that just needs to be fetched and fetched again for new entries?


The usecase is that you index RSS-feeds, but your users can search each 
feed-entry as a single document. Does it makes sense?


Thanks,
Renaud


Cheers,
  Chris

  

Doug



__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



  



--
Renaud Richardet  +1 617 230 9112
my email is my first name at apache.org  http://www.oslutions.com



Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Doug Cutting

Renaud Richardet wrote:
The usecase is that you index RSS-feeds, but your users can search each 
feed-entry as a single document. Does it makes sense?


But each feed item also contains a link whose content will be indexed 
and that's generally a superset of the item.  So should there be two 
urls indexed per item?  In many cases, the best thing to do is to index 
only the linked page, not the feed item at all.  In some (rare?) cases, 
there might be items without a link, whose only content is directly in 
the feed, or where the content in the feed is complementary to that in 
the linked page.  In these cases it might be useful to combine the two 
(the feed item and the linked content), indexing both.  The proposed 
change might permit that.  Is that the case you're concerned about?


Doug


Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Renaud Richardet

Doug Cutting wrote:

Renaud Richardet wrote:
The usecase is that you index RSS-feeds, but your users can search 
each feed-entry as a single document. Does it makes sense?


But each feed item also contains a link whose content will be indexed 
and that's generally a superset of the item.  

Agreed
So should there be two urls indexed per item?  

I don't think so
In many cases, the best thing to do is to index only the linked page, 
not the feed item at all.  In some (rare?) cases, there might be items 
without a link, whose only content is directly in the feed, or where 
the content in the feed is complementary to that in the linked page.  
In these cases it might be useful to combine the two (the feed item 
and the linked content), indexing both.  The proposed change might 
permit that.  Is that the case you're concerned about?
I see. I was thinking that I could index the feed items without having 
to fetch them individually.


More fundamentally, I want to index only the blog-entry text, and not 
the elements around it (header, menus, ads, ...), so as to improve the 
search results.


Here's my case, the proposed changes would allow me to do (*)

1) parse feeds:

for each (feedentry : feed) do
|
|  if (full-text entries) then
|   |  index each feed entry as a single document; blog header, menus 
are not indexed. *

|  else
|   |  create a special outlink for each feed entry, which include 
metadata (content, time, etc)

|  endif
|
done

2) on a next fetch loop:

for each (link) do
|
|  if (this is a normal link)
||  fetch it and index it normally
|  else if (this link come from an already indexed feed entry) then
||  end, do not fetch it *
|  else if (this is a special outlink)
||  guess which DOM nodes hold the post content
||  index it; blog header, menus are not indexed.
|  endif
|
done


Thanks,
Renaud


Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Doğacan Güney
Renaud Richardet wrote:
 Doug Cutting wrote:
 Renaud Richardet wrote:
 The usecase is that you index RSS-feeds, but your users can search
 each feed-entry as a single document. Does it makes sense?

 But each feed item also contains a link whose content will be indexed
 and that's generally a superset of the item.  
 Agreed
 So should there be two urls indexed per item?  
 I don't think so
 In many cases, the best thing to do is to index only the linked page,
 not the feed item at all.  In some (rare?) cases, there might be
 items without a link, whose only content is directly in the feed, or
 where the content in the feed is complementary to that in the linked
 page.  In these cases it might be useful to combine the two (the feed
 item and the linked content), indexing both.  The proposed change
 might permit that.  Is that the case you're concerned about?
 I see. I was thinking that I could index the feed items without having
 to fetch them individually.

 More fundamentally, I want to index only the blog-entry text, and not
 the elements around it (header, menus, ads, ...), so as to improve the
 search results.

 Here's my case, the proposed changes would allow me to do (*)

 1) parse feeds:

 for each (feedentry : feed) do
 |
 |  if (full-text entries) then
 |   |  index each feed entry as a single document; blog header, menus
 are not indexed. *
 |  else
 |   |  create a special outlink for each feed entry, which include
 metadata (content, time, etc)
 |  endif
 |
 done

 2) on a next fetch loop:

 for each (link) do
 |
 |  if (this is a normal link)
 ||  fetch it and index it normally
 |  else if (this link come from an already indexed feed entry) then
 ||  end, do not fetch it *
 |  else if (this is a special outlink)
 ||  guess which DOM nodes hold the post content
 ||  index it; blog header, menus are not indexed.
 |  endif
 |
 done

I agree with Renaud Richardet.

Also, I think it all boils down to speed. if you are building a blog
search engine, you want
it to update feeds as fast as it can. Doing 2 depths(one for rss-feed,
one for outlinks) will slow it down.

Besides that, many blog crawlers(like
http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html) set
crawl-delay  to 1 and so I guess most of the web servers are OK with
that for rss-feeds, but not necessarily
OK with it for HTML pages. (So you will do depth 1(rss-feeds) very
fast(with a 1 second delay), and then get the
items with 5 second delay.)

(I hope it is not stupid to point out Yahoo's crawler to someone who
works at Yahoo :)

--
Doğacan Güney


 Thanks,
 Renaud






Re: RSS-fecter and index individul-how can i realize this function

2007-02-05 Thread Doğacan Güney

Doug Cutting wrote:

Gal Nitzan wrote:
IMHO the data that is needed i.e. the data that will be fetched in 
the next fetch process is already available in the item element. 
Each item element represents one web resource. And there is no 
reason to go to the server and re-fetch that resource.


Perhaps ProtocolOutput should change.  The method:

  Content getContent();

could be deprecated and replaced with:

  Content[] getContents();

This would require changes to the indexing pipeline.  I can't think of 
any severe complications, but I haven't looked closely.


Since getProtocolOutput is called by Fetcher, fetcher(actually, the 
underlying protocol plugin) needs to be aware that we are actually 
fetching a rss feed and partially parse it to return an array of Contents.


I think it would make much more sense to change parse plugins to take 
content and return Parse[] instead of Parse.


--
Doğacan Güney


Could something like that work?

Doug







Re: RSS-fecter and index individul-how can i realize this function

2007-02-05 Thread Doug Cutting

Doğacan Güney wrote:
I think it would make much more sense to change parse plugins to take 
content and return Parse[] instead of Parse.


You're right.  That does make more sense.

Doug


Re: RSS-fecter and index individul-how can i realize this function

2007-02-04 Thread kauu

I've change code like what u said, but i get an exception like this.
why, why is the MD5Signature class's exception


2007-02-05 11:28:38,453 WARN  feedparser.FeedFilter (
FeedFilter.java:doDecodeEntities(223)) - Filter encountered unknown entities
2007-02-05 11:28:39,390 INFO  crawl.SignatureFactory (
SignatureFactory.java:getSignature(45)) - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2007-02-05 11:28:40,078 WARN  mapred.LocalJobRunner
(LocalJobRunner.java:run(120))
- job_f6j55m
java.lang.NullPointerException
   at org.apache.nutch.parse.ParseOutputFormat$1.write(
ParseOutputFormat.java:121)
   at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(
FetcherOutputFormat.java:87)
   at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:235)
   at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(
IdentityReducer.java:39)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:247)
   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
:112)


On 2/3/07, Renaud Richardet [EMAIL PROTECTED] wrote:


Gal, Chris, Kauu,

So, if I understand correctly, you need a way to pass information along
the fetches, so that when Nutch fetches a feed entry, its item value
previously fetched is available.

This is how I tackled the issue:
- extend Outlinks.java and allow to create outlinks with more meta data.
So, in your feed parser, use this way to create outlinks
- pass on the metadata through ParseOutputFormat.java and Fetcher.java
- retrieve the metadata in HtmlParser.java and use it

This is very tedious, will blow the size of your outlinks db, makes
changes in the core code of Nutch, etc... But this is the only way I
came up with...
If someone sees a better way, please let me know :-)

Sample code, for Nutch 0.8.x :

Outlink.java
+  public Outlink(String toUrl, String anchor, String entryContents,
Configuration conf) throws MalformedURLException {
+  this.toUrl = new
UrlNormalizerFactory(conf).getNormalizer().normalize(toUrl);
+  this.anchor = anchor;
+
+  this.entryContents= entryContents;
+  }
and update the other methods

ParseOutputFormat.java, around lines 140
+// set outlink info in metadata ME
+String entryContents= links[i].getEntryContents();
+
+if (entryContents.length()  0) { // it's a feed entry
+MapWritable meta = new MapWritable();
+meta.put(new UTF8(entryContents), new
UTF8(entryContents));//key/value
+target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
interval);
+target.setMetaData(meta);
+} else {
+target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
interval); // no meta
+}

Fetcher.java, around l. 266
+  // add feed info to metadata
+  try {
+  String entryContents = datum.getMetaData().get(new
UTF8(entryContents)).toString();
+  metadata.set(entryContents, entryContents);
+  } catch (Exception e) { } //not found

HtmlParser.java
// get entry metadata
String entryContents = content.getMetadata().get(entryContents);

HTH,
Renaud



Gal Nitzan wrote:
 Hi Chris,

 I'm sorry I wasn't clear enough. What I mean is that in the current
implementation:

 1. The RSS (channels, items) page ends up as one Lucene document in the
index.
 2. Indeed the links are extracted and each item link will be fetched
in the next fetch as a separate page and will end up as one Lucene document.

 IMHO the data that is needed i.e. the data that will be fetched in the
next fetch process is already available in the item element. Each item
element represents one web resource. And there is no reason to go to the
server and re-fetch that resource.

 Another issue that arises from rss feeds is that once the feed page is
fetched you can not re-fetch it until its time to fetch expired. The feeds
TTL is usually very short. Since for now in Nutch, all pages created equal
:) it is one more thing to think about.

 HTH,

 Gal.

 -Original Message-
 From: Chris Mattmann [mailto:[EMAIL PROTECTED]
 Sent: Thursday, February 01, 2007 7:01 PM
 To: nutch-dev@lucene.apache.org
 Subject: Re: RSS-fecter and index individul-how can i realize this
function

 Hi Gal, et al.,

   I'd like to be explicit when we talk about what the issue with the RSS
 parsing plugin is here; I think we have had conversations similar to
this
 before and it seems that we keep talking around each other. I'd like to
get
 to the heart of this matter so that the issue (if there is an actual
one)
 gets addressed ;)

   Okay, so you mention below that the thing that you see missing from
the
 current RSS parsing plugin is the ability to store data in the
CrawlDatum,
 and parse it in the next fetch phase. Well, there are 2 options here
for
 what you refer to as it:

  1. If you're talking about the RSS file, then in fact, it is parsed,
and
 its data is stored in the CrawlDatum, akin to any other form of content
that
 is fetched, parsed

Re: RSS-fecter and index individul-how can i realize this function

2007-02-02 Thread Renaud Richardet

Gal, Chris, Kauu,

So, if I understand correctly, you need a way to pass information along 
the fetches, so that when Nutch fetches a feed entry, its item value 
previously fetched is available.


This is how I tackled the issue:
- extend Outlinks.java and allow to create outlinks with more meta data. 
So, in your feed parser, use this way to create outlinks

- pass on the metadata through ParseOutputFormat.java and Fetcher.java
- retrieve the metadata in HtmlParser.java and use it

This is very tedious, will blow the size of your outlinks db, makes 
changes in the core code of Nutch, etc... But this is the only way I 
came up with...

If someone sees a better way, please let me know :-)

Sample code, for Nutch 0.8.x :

Outlink.java
+  public Outlink(String toUrl, String anchor, String entryContents, 
Configuration conf) throws MalformedURLException {
+  this.toUrl = new 
UrlNormalizerFactory(conf).getNormalizer().normalize(toUrl);

+  this.anchor = anchor;
+
+  this.entryContents= entryContents;
+  }
and update the other methods

ParseOutputFormat.java, around lines 140
+// set outlink info in metadata ME
+String entryContents= links[i].getEntryContents();
+
+if (entryContents.length()  0) { // it's a feed entry
+MapWritable meta = new MapWritable();
+meta.put(new UTF8(entryContents), new 
UTF8(entryContents));//key/value
+target = new CrawlDatum(CrawlDatum.STATUS_LINKED, 
interval);

+target.setMetaData(meta);
+} else {
+target = new CrawlDatum(CrawlDatum.STATUS_LINKED, 
interval); // no meta

+}

Fetcher.java, around l. 266
+  // add feed info to metadata
+  try {
+  String entryContents = datum.getMetaData().get(new 
UTF8(entryContents)).toString();

+  metadata.set(entryContents, entryContents);
+  } catch (Exception e) { } //not found

HtmlParser.java
// get entry metadata
   String entryContents = content.getMetadata().get(entryContents);

HTH,
Renaud



Gal Nitzan wrote:

Hi Chris,

I'm sorry I wasn't clear enough. What I mean is that in the current 
implementation:

1. The RSS (channels, items) page ends up as one Lucene document in the index.
2. Indeed the links are extracted and each item link will be fetched in the 
next fetch as a separate page and will end up as one Lucene document.

IMHO the data that is needed i.e. the data that will be fetched in the next fetch process 
is already available in the item element. Each item element represents one 
web resource. And there is no reason to go to the server and re-fetch that resource.

Another issue that arises from rss feeds is that once the feed page is fetched you can 
not re-fetch it until its time to fetch expired. The feeds TTL is usually 
very short. Since for now in Nutch, all pages created equal :) it is one more thing to 
think about.

HTH,

Gal.

-Original Message-
From: Chris Mattmann [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 01, 2007 7:01 PM

To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi Gal, et al.,

  I'd like to be explicit when we talk about what the issue with the RSS
parsing plugin is here; I think we have had conversations similar to this
before and it seems that we keep talking around each other. I'd like to get
to the heart of this matter so that the issue (if there is an actual one)
gets addressed ;)

  Okay, so you mention below that the thing that you see missing from the
current RSS parsing plugin is the ability to store data in the CrawlDatum,
and parse it in the next fetch phase. Well, there are 2 options here for
what you refer to as it:

 1. If you're talking about the RSS file, then in fact, it is parsed, and
its data is stored in the CrawlDatum, akin to any other form of content that
is fetched, parsed and indexed.

 2. If you're talking about the item links within the RSS file, in fact,
they are parsed (eventually), and their data stored in the CrawlDatum, akin
to any other form of content that is fetched, parsed, and indexed. This is
accomplished by adding the RSS items as Outlinks when the RSS file is
parsed: in this fashion, we go after all of the links in the RSS file, and
make sure that we index their content as well.

Thus, if you had an RSS file R that contained links in it to a PDF file A,
and another HTML page P, then not only would R get fetched, parsed, and
indexed, but so would A and P, because they are item links within R. Then
queries that would match R (the physical RSS file), would additionally match
things such as P and A, and all 3 would be capable of being returned in a
Nutch query. Does this make sense? Is this the issue that you're talking
about? Am I nuts? ;)

Cheers,
  Chris




On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote:

  

Hi,

Many sites provide RSS feeds for several reasons, usually to save bandwidth,
to give

Re: RSS-fecter and index individul-how can i realize this function

2007-02-02 Thread Doug Cutting

Gal Nitzan wrote:

IMHO the data that is needed i.e. the data that will be fetched in the next fetch process 
is already available in the item element. Each item element represents one 
web resource. And there is no reason to go to the server and re-fetch that resource.


Perhaps ProtocolOutput should change.  The method:

  Content getContent();

could be deprecated and replaced with:

  Content[] getContents();

This would require changes to the indexing pipeline.  I can't think of 
any severe complications, but I haven't looked closely.


Could something like that work?

Doug


Re: RSS-fecter and index individul-how can i realize this function

2007-02-01 Thread Chris Mattmann
Hi Gal, et al.,

  I'd like to be explicit when we talk about what the issue with the RSS
parsing plugin is here; I think we have had conversations similar to this
before and it seems that we keep talking around each other. I'd like to get
to the heart of this matter so that the issue (if there is an actual one)
gets addressed ;)

  Okay, so you mention below that the thing that you see missing from the
current RSS parsing plugin is the ability to store data in the CrawlDatum,
and parse it in the next fetch phase. Well, there are 2 options here for
what you refer to as it:

 1. If you're talking about the RSS file, then in fact, it is parsed, and
its data is stored in the CrawlDatum, akin to any other form of content that
is fetched, parsed and indexed.

 2. If you're talking about the item links within the RSS file, in fact,
they are parsed (eventually), and their data stored in the CrawlDatum, akin
to any other form of content that is fetched, parsed, and indexed. This is
accomplished by adding the RSS items as Outlinks when the RSS file is
parsed: in this fashion, we go after all of the links in the RSS file, and
make sure that we index their content as well.

Thus, if you had an RSS file R that contained links in it to a PDF file A,
and another HTML page P, then not only would R get fetched, parsed, and
indexed, but so would A and P, because they are item links within R. Then
queries that would match R (the physical RSS file), would additionally match
things such as P and A, and all 3 would be capable of being returned in a
Nutch query. Does this make sense? Is this the issue that you're talking
about? Am I nuts? ;)

Cheers,
  Chris




On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote:

 Hi,
 
 Many sites provide RSS feeds for several reasons, usually to save bandwidth,
 to give the users concentrated data and so forth.
 
 Some of the RSS files supplied by sites are created specially for search
 engines where each RSS item represent a web page in the site.
 
 IMHO the only thing missing in the parse-rss plugin is storing the data in
 the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new
 flag to CrawlDatum, that would flag the URL as parsable not fetchable?
 
 Just my two cents...
 
 Gal.
 
 -Original Message-
 From: Chris Mattmann [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, January 31, 2007 8:44 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: RSS-fecter and index individul-how can i realize this function
 
 Hi there,
 
   With the explanation that you give below, it seems like parse-rss as it
 exists would address what you are trying to do. parse-rss parses an RSS
 channel as a set of items, and indexes overall metadata about the RSS file,
 including parse text, and index data, but it also adds each item (in the
 channel)'s URL as an Outlink, so that Nutch will process those pieces of
 content as well. The only thing that you suggest below that parse-rss
 currently doesn't do, is to allow you to associate the metadata fields
 category:, and author: with the item Outlink...
 
 Cheers,
   Chris
 
 
 
 On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:
 
 thx for ur reply .
 mybe i didn't tell clearly .
  I want to index the item as a
 individual page .then when i search the some
 thing for example nutch-open
 source, the nutch return a hit which contain
 
title : nutch-open source
 
 description : nutch nutch nutch nutch  nutch
url :
 http://lucene.apache.org/nutch
category : news
   author  : kauu
 
 so , is
 the plugin parse-rss can satisfy what i need?
 
 item
 titlenutch--open
 source/title
description
 
nutch nutch nutch nutch
 nutch
 /description
 
 
 
 linkhttp://lucene.apache.org/nutch/link
 
 
 categorynews
 /category
 
 
 authorkauu/author
 
 
 
 On 1/31/07, Chris
 Mattmann [EMAIL PROTECTED] wrote:
 
 Hi there,
 
 I could most
 likely be of assistance, if you gave me some more
 information.
 For
 instance: I'm wondering if the use case you describe below is already
 
 supported by the current RSS parse plugin?
 
 The current RSS parser,
 parse-rss, does in fact index individual items
 that
 are pointed to by an
 RSS document. The items are added as Nutch Outlinks,
 and added to the
 overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below?
 Or am I missing something?
 
 Cheers,
   Chris
 
 
 
 On 1/30/07 6:01 PM,
 kauu [EMAIL PROTECTED] wrote:
 
 Hi folks :
 
What's I want to
 do is to separate a rss file into several pages .
 
   Just as what has
 been discussed before. I want fetch a rss page and
 index
 it as different
 documents in the index. So the searcher can search the
 Item's info as a
 individual hit.
 
  What's my opinion create a protocol for fetch the rss
 page and store it
 as
 several one which just contain one ITEM tag .but
 the unique key is the
 url ,
 so how can I store them with the ITEM's link
 tag as the unique key for a
 document.
 
   So my question is how

RE: RSS-fecter and index individul-how can i realize this function

2007-02-01 Thread Gal Nitzan

Hi Chris,

I'm sorry I wasn't clear enough. What I mean is that in the current 
implementation:

1. The RSS (channels, items) page ends up as one Lucene document in the index.
2. Indeed the links are extracted and each item link will be fetched in the 
next fetch as a separate page and will end up as one Lucene document.

IMHO the data that is needed i.e. the data that will be fetched in the next 
fetch process is already available in the item element. Each item element 
represents one web resource. And there is no reason to go to the server and 
re-fetch that resource.

Another issue that arises from rss feeds is that once the feed page is fetched 
you can not re-fetch it until its time to fetch expired. The feeds TTL is 
usually very short. Since for now in Nutch, all pages created equal :) it is 
one more thing to think about.

HTH,

Gal.

-Original Message-
From: Chris Mattmann [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 01, 2007 7:01 PM
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi Gal, et al.,

  I'd like to be explicit when we talk about what the issue with the RSS
parsing plugin is here; I think we have had conversations similar to this
before and it seems that we keep talking around each other. I'd like to get
to the heart of this matter so that the issue (if there is an actual one)
gets addressed ;)

  Okay, so you mention below that the thing that you see missing from the
current RSS parsing plugin is the ability to store data in the CrawlDatum,
and parse it in the next fetch phase. Well, there are 2 options here for
what you refer to as it:

 1. If you're talking about the RSS file, then in fact, it is parsed, and
its data is stored in the CrawlDatum, akin to any other form of content that
is fetched, parsed and indexed.

 2. If you're talking about the item links within the RSS file, in fact,
they are parsed (eventually), and their data stored in the CrawlDatum, akin
to any other form of content that is fetched, parsed, and indexed. This is
accomplished by adding the RSS items as Outlinks when the RSS file is
parsed: in this fashion, we go after all of the links in the RSS file, and
make sure that we index their content as well.

Thus, if you had an RSS file R that contained links in it to a PDF file A,
and another HTML page P, then not only would R get fetched, parsed, and
indexed, but so would A and P, because they are item links within R. Then
queries that would match R (the physical RSS file), would additionally match
things such as P and A, and all 3 would be capable of being returned in a
Nutch query. Does this make sense? Is this the issue that you're talking
about? Am I nuts? ;)

Cheers,
  Chris




On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote:

 Hi,
 
 Many sites provide RSS feeds for several reasons, usually to save bandwidth,
 to give the users concentrated data and so forth.
 
 Some of the RSS files supplied by sites are created specially for search
 engines where each RSS item represent a web page in the site.
 
 IMHO the only thing missing in the parse-rss plugin is storing the data in
 the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new
 flag to CrawlDatum, that would flag the URL as parsable not fetchable?
 
 Just my two cents...
 
 Gal.
 
 -Original Message-
 From: Chris Mattmann [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, January 31, 2007 8:44 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: RSS-fecter and index individul-how can i realize this function
 
 Hi there,
 
   With the explanation that you give below, it seems like parse-rss as it
 exists would address what you are trying to do. parse-rss parses an RSS
 channel as a set of items, and indexes overall metadata about the RSS file,
 including parse text, and index data, but it also adds each item (in the
 channel)'s URL as an Outlink, so that Nutch will process those pieces of
 content as well. The only thing that you suggest below that parse-rss
 currently doesn't do, is to allow you to associate the metadata fields
 category:, and author: with the item Outlink...
 
 Cheers,
   Chris
 
 
 
 On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:
 
 thx for ur reply .
 mybe i didn't tell clearly .
  I want to index the item as a
 individual page .then when i search the some
 thing for example nutch-open
 source, the nutch return a hit which contain
 
title : nutch-open source
 
 description : nutch nutch nutch nutch  nutch
url :
 http://lucene.apache.org/nutch
category : news
   author  : kauu
 
 so , is
 the plugin parse-rss can satisfy what i need?
 
 item
 titlenutch--open
 source/title
description
 
nutch nutch nutch nutch
 nutch
 /description
 
 
 
 linkhttp://lucene.apache.org/nutch/link
 
 
 categorynews
 /category
 
 
 authorkauu/author
 
 
 
 On 1/31/07, Chris
 Mattmann [EMAIL PROTECTED] wrote:
 
 Hi there,
 
 I could most
 likely

Re: RSS-fecter and index individul-how can i realize this function

2007-02-01 Thread kauu

hi all,
 what Gal said is just my meaning on the rss-parse need.
 i just want to fetch rss seeds once,



On 2/2/07, Gal Nitzan [EMAIL PROTECTED] wrote:



Hi Chris,

I'm sorry I wasn't clear enough. What I mean is that in the current
implementation:

1. The RSS (channels, items) page ends up as one Lucene document in the
index.
2. Indeed the links are extracted and each item link will be fetched in
the next fetch as a separate page and will end up as one Lucene document.

IMHO the data that is needed i.e. the data that will be fetched in the
next fetch process is already available in the item element. Each item
element represents one web resource. And there is no reason to go to the
server and re-fetch that resource.

Another issue that arises from rss feeds is that once the feed page is
fetched you can not re-fetch it until its time to fetch expired. The feeds
TTL is usually very short. Since for now in Nutch, all pages created equal
:) it is one more thing to think about.

HTH,

Gal.

-Original Message-
From: Chris Mattmann [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 01, 2007 7:01 PM
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this
function

Hi Gal, et al.,

  I'd like to be explicit when we talk about what the issue with the RSS
parsing plugin is here; I think we have had conversations similar to this
before and it seems that we keep talking around each other. I'd like to
get
to the heart of this matter so that the issue (if there is an actual one)
gets addressed ;)

  Okay, so you mention below that the thing that you see missing from the
current RSS parsing plugin is the ability to store data in the CrawlDatum,
and parse it in the next fetch phase. Well, there are 2 options here for
what you refer to as it:

1. If you're talking about the RSS file, then in fact, it is parsed, and
its data is stored in the CrawlDatum, akin to any other form of content
that
is fetched, parsed and indexed.

2. If you're talking about the item links within the RSS file, in fact,
they are parsed (eventually), and their data stored in the CrawlDatum,
akin
to any other form of content that is fetched, parsed, and indexed. This is
accomplished by adding the RSS items as Outlinks when the RSS file is
parsed: in this fashion, we go after all of the links in the RSS file, and
make sure that we index their content as well.

Thus, if you had an RSS file R that contained links in it to a PDF file A,
and another HTML page P, then not only would R get fetched, parsed, and
indexed, but so would A and P, because they are item links within R. Then
queries that would match R (the physical RSS file), would additionally
match
things such as P and A, and all 3 would be capable of being returned in a
Nutch query. Does this make sense? Is this the issue that you're talking
about? Am I nuts? ;)

Cheers,
  Chris




On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote:

 Hi,

 Many sites provide RSS feeds for several reasons, usually to save
bandwidth,
 to give the users concentrated data and so forth.

 Some of the RSS files supplied by sites are created specially for search
 engines where each RSS item represent a web page in the site.

 IMHO the only thing missing in the parse-rss plugin is storing the
data in
 the CrawlDatum and parsing it in the next fetch phase. Maybe adding a
new
 flag to CrawlDatum, that would flag the URL as parsable not
fetchable?

 Just my two cents...

 Gal.

 -Original Message-
 From: Chris Mattmann [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, January 31, 2007 8:44 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: RSS-fecter and index individul-how can i realize this
function

 Hi there,

   With the explanation that you give below, it seems like parse-rss as
it
 exists would address what you are trying to do. parse-rss parses an RSS
 channel as a set of items, and indexes overall metadata about the RSS
file,
 including parse text, and index data, but it also adds each item (in the
 channel)'s URL as an Outlink, so that Nutch will process those pieces of
 content as well. The only thing that you suggest below that parse-rss
 currently doesn't do, is to allow you to associate the metadata fields
 category:, and author: with the item Outlink...

 Cheers,
   Chris



 On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:

 thx for ur reply .
 mybe i didn't tell clearly .
  I want to index the item as a
 individual page .then when i search the some
 thing for example nutch-open
 source, the nutch return a hit which contain

title : nutch-open source

 description : nutch nutch nutch nutch  nutch
url :
 http://lucene.apache.org/nutch
category : news
   author  : kauu

 so , is
 the plugin parse-rss can satisfy what i need?

 item
 titlenutch--open
 source/title
description

nutch nutch nutch nutch
 nutch
 /description



 linkhttp://lucene.apache.org/nutch/link


 categorynews
 /category

Re: RSS-fecter and index individul-how can i realize this function

2007-01-31 Thread kauu

hi ,
thx any way , but i don't think I tell clearly enough.

what i want  is nutch  just fetch  rss seeds for 1 depth. So  nutch should
just  fetch some xml pages .I don't want to fetch the items' outlink 's
pages, because there r too much spam in those pages.
 so , i just need to parse the rss file.
so when i search some words which in description tag in one xml's item. the
return hit will be like this
title ==one item's title
summary ==one item's description
link ==one itme's outlink.

so , i don't know whether the parse-rss plugin provide this function?

On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote:


Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS
file,
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...

Cheers,
  Chris



On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:

 thx for ur reply .
mybe i didn't tell clearly .
I want to index the item as a
 individual page .then when i search the some
thing for example nutch-open
 source, the nutch return a hit which contain

   title : nutch-open source

 description : nutch nutch nutch nutch  nutch
   url :
 http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
 the plugin parse-rss can satisfy what i need?

item
titlenutch--open
 source/title
   description

nutch nutch nutch nutch
 nutch
  /description
 
 
 
 linkhttp://lucene.apache.org/nutch/link
 
 
  categorynews
 /category
 
 
  authorkauu/author



On 1/31/07, Chris
 Mattmann [EMAIL PROTECTED] wrote:

 Hi there,

 I could most
 likely be of assistance, if you gave me some more
 information.
 For
 instance: I'm wondering if the use case you describe below is already

 supported by the current RSS parse plugin?

 The current RSS parser,
 parse-rss, does in fact index individual items
 that
 are pointed to by an
 RSS document. The items are added as Nutch Outlinks,
 and added to the
 overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below?
 Or am I missing something?

 Cheers,
   Chris



 On 1/30/07 6:01 PM,
 kauu [EMAIL PROTECTED] wrote:

  Hi folks :
 
 What's I want to
 do is to separate a rss file into several pages .
 
Just as what has
 been discussed before. I want fetch a rss page and
 index
  it as different
 documents in the index. So the searcher can search the
  Item's info as a
 individual hit.
 
   What's my opinion create a protocol for fetch the rss
 page and store it
 as
  several one which just contain one ITEM tag .but
 the unique key is the
 url ,
  so how can I store them with the ITEM's link
 tag as the unique key for a
  document.
 
So my question is how to
 realize this function in nutch-.0.8.x.
 
I've check the code of the
 plug-in protocol-http's code ,but I can't
  find the code where to store a
 page to a document. I want to separate
 the
  rss page to several ones
 before storing it as a document but several
 ones.
 
So any one can
 give me some hints?
 
  Any reply will be appreciated !
 
 
 
 

 
ITEM's structure
 
   item
 
 
  title欧洲暴风雪后发制人 致航班
 延误交通混乱(组图)/title
 
 
  description暴风雪横扫欧洲,导致多次航班延误 1
 月24日,几架民航客机在德
  国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
 的慕尼黑机场
  清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
 

 
 
  /description
 
 
 
 linkhttp://news.sohu.com/20070125
 
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 
 link
 
 
  category搜狐焦点图新闻/category
 
 
 
 author[EMAIL PROTECTED]
  /author
 
 
  pubDateThu, 25 Jan 2007
 11:29:11 +0800/pubDate
 
 
  comments
 
 http://comment.news.sohu.com
 
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 
 /comment/topic.jsp?id=247833847/comments
 
 
  /item
 
 

 





--
www.babatu.com







--
www.babatu.com


RE: RSS-fecter and index individul-how can i realize this function

2007-01-31 Thread Gal Nitzan
Hi,

Many sites provide RSS feeds for several reasons, usually to save bandwidth, to 
give the users concentrated data and so forth.

Some of the RSS files supplied by sites are created specially for search 
engines where each RSS item represent a web page in the site.

IMHO the only thing missing in the parse-rss plugin is storing the data in 
the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new 
flag to CrawlDatum, that would flag the URL as parsable not fetchable? 

Just my two cents...

Gal.

-Original Message-
From: Chris Mattmann [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 31, 2007 8:44 AM
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS file,
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...

Cheers,
  Chris



On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:

 thx for ur reply .
mybe i didn't tell clearly .
 I want to index the item as a
 individual page .then when i search the some
thing for example nutch-open
 source, the nutch return a hit which contain

   title : nutch-open source

 description : nutch nutch nutch nutch  nutch
   url :
 http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
 the plugin parse-rss can satisfy what i need?

item
titlenutch--open
 source/title
   description

nutch nutch nutch nutch
 nutch
  /description
 
 
 
 linkhttp://lucene.apache.org/nutch/link
 
 
  categorynews
 /category
 
 
  authorkauu/author



On 1/31/07, Chris
 Mattmann [EMAIL PROTECTED] wrote:

 Hi there,

 I could most
 likely be of assistance, if you gave me some more
 information.
 For
 instance: I'm wondering if the use case you describe below is already

 supported by the current RSS parse plugin?

 The current RSS parser,
 parse-rss, does in fact index individual items
 that
 are pointed to by an
 RSS document. The items are added as Nutch Outlinks,
 and added to the
 overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below?
 Or am I missing something?

 Cheers,
   Chris



 On 1/30/07 6:01 PM,
 kauu [EMAIL PROTECTED] wrote:

  Hi folks :
 
 What's I want to
 do is to separate a rss file into several pages .
 
Just as what has
 been discussed before. I want fetch a rss page and
 index
  it as different
 documents in the index. So the searcher can search the
  Item's info as a
 individual hit.
 
   What's my opinion create a protocol for fetch the rss
 page and store it
 as
  several one which just contain one ITEM tag .but
 the unique key is the
 url ,
  so how can I store them with the ITEM's link
 tag as the unique key for a
  document.
 
So my question is how to
 realize this function in nutch-.0.8.x.
 
I've check the code of the
 plug-in protocol-http's code ,but I can't
  find the code where to store a
 page to a document. I want to separate
 the
  rss page to several ones
 before storing it as a document but several
 ones.
 
So any one can
 give me some hints?
 
  Any reply will be appreciated !
 
 
 
 

 
ITEM's structure
 
   item
 
 
  title欧洲暴风雪后发制人 致航班
 延误交通混乱(组图)/title
 
 
  description暴风雪横扫欧洲,导致多次航班延误 1
 月24日,几架民航客机在德
  国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
 的慕尼黑机场
  清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
 

 
 
  /description
 
 
 
 linkhttp://news.sohu.com/20070125
 
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 
 link
 
 
  category搜狐焦点图新闻/category
 
 
 
 author[EMAIL PROTECTED]
  /author
 
 
  pubDateThu, 25 Jan 2007
 11:29:11 +0800/pubDate
 
 
  comments
 
 http://comment.news.sohu.com
 
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 
 /comment/topic.jsp?id=247833847/comments
 
 
  /item
 
 

 





-- 
www.babatu.com







RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread kauu
Hi folks :

   What’s I want to do is to separate a rss file into several pages .

  Just as what has been discussed before. I want fetch a rss page and index
it as different documents in the index. So the searcher can search the
Item’s info as a individual hit.

 What’s my opinion create a protocol for fetch the rss page and store it as
several one which just contain one ITEM tag .but the unique key is the url ,
so how can I store them with the ITEM’s link tag as the unique key for a
document.

  So my question is how to realize this function in nutch-.0.8.x. 

  I’ve check the code of the plug-in protocol-http’s code ,but I can’t
find the code where to store a page to a document. I want to separate the
rss page to several ones before storing it as a document but several ones.

  So any one can give me some hints?

Any reply will be appreciated !

 

 

  ITEM’s structure 

 item


title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title


description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...



/description


linkhttp://news.sohu.com/20070125
http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
link


category搜狐焦点图新闻/category


author[EMAIL PROTECTED]
/author


pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate


comments
http://comment.news.sohu.com
http://comment.news.sohu.com/comment/topic.jsp?id=247833847
/comment/topic.jsp?id=247833847/comments


/item

 



Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann
Hi there,

 I could most likely be of assistance, if you gave me some more information.
For instance: I'm wondering if the use case you describe below is already
supported by the current RSS parse plugin?

 The current RSS parser, parse-rss, does in fact index individual items that
are pointed to by an RSS document. The items are added as Nutch Outlinks,
and added to the overall queue of URLs to fetch. Doesn't this satisfy what
you mention below? Or am I missing something?

Cheers,
  Chris



On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote:

 Hi folks :
 
What’s I want to do is to separate a rss file into several pages .
 
   Just as what has been discussed before. I want fetch a rss page and index
 it as different documents in the index. So the searcher can search the
 Item’s info as a individual hit.
 
  What’s my opinion create a protocol for fetch the rss page and store it as
 several one which just contain one ITEM tag .but the unique key is the url ,
 so how can I store them with the ITEM’s link tag as the unique key for a
 document.
 
   So my question is how to realize this function in nutch-.0.8.x.
 
   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
 find the code where to store a page to a document. I want to separate the
 rss page to several ones before storing it as a document but several ones.
 
   So any one can give me some hints?
 
 Any reply will be appreciated !
 
  
 
  
 
   ITEM’s structure
 
  item
 
 
 title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title
 
 
 description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
 清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...
 
 
 
 /description
 
 
 linkhttp://news.sohu.com/20070125
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 link
 
 
 category搜狐焦点图新闻/category
 
 
 author[EMAIL PROTECTED]
 /author
 
 
 pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate
 
 
 comments
 http://comment.news.sohu.com
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 /comment/topic.jsp?id=247833847/comments
 
 
 /item
 
  
 




Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread kauu

thx for ur reply .
mybe i didn't tell clearly .
I want to index the item as a individual page .then when i search the some
thing for example nutch-open source, the nutch return a hit which contain

  title : nutch-open source
  description : nutch nutch nutch nutch  nutch
  url : http://lucene.apache.org/nutch
  category : news
 author  : kauu

so , is the plugin parse-rss can satisfy what i need?

item
   titlenutch--open source/title
  description


   nutch nutch nutch nutch  nutch
 /description


 linkhttp://lucene.apache.org/nutch/link


 categorynews /category


 authorkauu/author




On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote:


Hi there,

I could most likely be of assistance, if you gave me some more
information.
For instance: I'm wondering if the use case you describe below is already
supported by the current RSS parse plugin?

The current RSS parser, parse-rss, does in fact index individual items
that
are pointed to by an RSS document. The items are added as Nutch Outlinks,
and added to the overall queue of URLs to fetch. Doesn't this satisfy what
you mention below? Or am I missing something?

Cheers,
  Chris



On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote:

 Hi folks :

What's I want to do is to separate a rss file into several pages .

   Just as what has been discussed before. I want fetch a rss page and
index
 it as different documents in the index. So the searcher can search the
 Item's info as a individual hit.

  What's my opinion create a protocol for fetch the rss page and store it
as
 several one which just contain one ITEM tag .but the unique key is the
url ,
 so how can I store them with the ITEM's link tag as the unique key for a
 document.

   So my question is how to realize this function in nutch-.0.8.x.

   I've check the code of the plug-in protocol-http's code ,but I can't
 find the code where to store a page to a document. I want to separate
the
 rss page to several ones before storing it as a document but several
ones.

   So any one can give me some hints?

 Any reply will be appreciated !





   ITEM's structure

  item


 title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title


 description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...



 /description


 linkhttp://news.sohu.com/20070125
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 link


 category搜狐焦点图新闻/category


 author[EMAIL PROTECTED]
 /author


 pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate


 comments
 http://comment.news.sohu.com
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 /comment/topic.jsp?id=247833847/comments


 /item









--
www.babatu.com


Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann
Hi there,

On 1/30/07 7:00 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Chris,
 
 I saw your name associated with the rss parser in nutch.  My understanding is
 that nutch is using feedparser.  I had two questions:
 
 1.  Have you looked at vtd as an rss parser?

I haven't in fact; what are its benefits over those of commons-feedparser?

 2.  Any view on asynchronous communication as the underlying protocol?  I do
 not believe that feedparser uses that at this point.

I'm not sure exactly what asynchronous communication when parsing rss feeds
affords you: what type of communications are you talking about above? Nutch
handles the communications layer for fetching content using a pluggable,
Protocol-based model. The only feature that Nutch's rss parser uses from the
underlying feedparser library is its object model and callback framework for
parsing RSS/Atom/Feed XML documents. When you mention asynchronous above,
are you talking about the protocol for fetching the different RSS documents?

Thanks!

Cheers,
  Chris


 
 Thanks
   
 
 -Original Message-
 From: Chris Mattmann [EMAIL PROTECTED]
 Date: Tue, 30 Jan 2007 18:16:44
 To:nutch-dev@lucene.apache.org
 Subject: Re: RSS-fecter and index individul-how can i realize this function
 
 Hi there,
 
  I could most likely be of assistance, if you gave me some more information.
 For instance: I'm wondering if the use case you describe below is already
 supported by the current RSS parse plugin?
 
  The current RSS parser, parse-rss, does in fact index individual items that
 are pointed to by an RSS document. The items are added as Nutch Outlinks,
 and added to the overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below? Or am I missing something?
 
 Cheers,
   Chris
 
 
 
 On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote:
 
 Hi folks :
 
What’s I want to do is to separate a rss file into several pages .
 
   Just as what has been discussed before. I want fetch a rss page and index
 it as different documents in the index. So the searcher can search the
 Item’s info as a individual hit.
 
  What’s my opinion create a protocol for fetch the rss page and store it as
 several one which just contain one ITEM tag .but the unique key is the url ,
 so how can I store them with the ITEM’s link tag as the unique key for a
 document.
 
   So my question is how to realize this function in nutch-.0.8.x.
 
   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
 find the code where to store a page to a document. I want to separate the
 rss page to several ones before storing it as a document but several ones.
 
   So any one can give me some hints?
 
 Any reply will be appreciated !
 
  
 
  
 
   ITEM’s structure
 
  item
 
 
 title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title
 
 
 description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
 清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...
 
 
 
 /description
 
 
 linkhttp://news.sohu.com/20070125
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 link
 
 
 category搜狐焦点图新闻/category
 
 
 author[EMAIL PROTECTED]
 /author
 
 
 pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate
 
 
 comments
 http://comment.news.sohu.com
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 /comment/topic.jsp?id=247833847/comments
 
 
 /item
 
  
 
 
 




Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread pdecrem

1.   Claims to be faster
2.   Asynchronous should take care of sitting and waiting for one fetch to 
return before you do the next. 

Ps I am not sure if you checked out tailrank.com for that branch of feedparser 
(I think its in code.tailrank.com/feedparser)

Thanks


  

-Original Message-
From: Chris Mattmann [EMAIL PROTECTED]
Date: Tue, 30 Jan 2007 19:34:49 
To:nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi there,

On 1/30/07 7:00 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Chris,
 
 I saw your name associated with the rss parser in nutch.  My understanding is
 that nutch is using feedparser.  I had two questions:
 
 1.  Have you looked at vtd as an rss parser?

I haven't in fact; what are its benefits over those of commons-feedparser?

 2.  Any view on asynchronous communication as the underlying protocol?  I do
 not believe that feedparser uses that at this point.

I'm not sure exactly what asynchronous communication when parsing rss feeds
affords you: what type of communications are you talking about above? Nutch
handles the communications layer for fetching content using a pluggable,
Protocol-based model. The only feature that Nutch's rss parser uses from the
underlying feedparser library is its object model and callback framework for
parsing RSS/Atom/Feed XML documents. When you mention asynchronous above,
are you talking about the protocol for fetching the different RSS documents?

Thanks!

Cheers,
  Chris


 
 Thanks
   
 
 -Original Message-
 From: Chris Mattmann [EMAIL PROTECTED]
 Date: Tue, 30 Jan 2007 18:16:44
 To:nutch-dev@lucene.apache.org
 Subject: Re: RSS-fecter and index individul-how can i realize this function
 
 Hi there,
 
  I could most likely be of assistance, if you gave me some more information.
 For instance: I'm wondering if the use case you describe below is already
 supported by the current RSS parse plugin?
 
  The current RSS parser, parse-rss, does in fact index individual items that
 are pointed to by an RSS document. The items are added as Nutch Outlinks,
 and added to the overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below? Or am I missing something?
 
 Cheers,
   Chris
 
 
 
 On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote:
 
 Hi folks :
 
What’s I want to do is to separate a rss file into several pages .
 
   Just as what has been discussed before. I want fetch a rss page and index
 it as different documents in the index. So the searcher can search the
 Item’s info as a individual hit.
 
  What’s my opinion create a protocol for fetch the rss page and store it as
 several one which just contain one ITEM tag .but the unique key is the url ,
 so how can I store them with the ITEM’s link tag as the unique key for a
 document.
 
   So my question is how to realize this function in nutch-.0.8.x.
 
   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
 find the code where to store a page to a document. I want to separate the
 rss page to several ones before storing it as a document but several ones.
 
   So any one can give me some hints?
 
 Any reply will be appreciated !
 
  
 
  
 
   ITEM’s structure
 
  item
 
 
 title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title
 
 
 description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
 清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...
 
 
 
 /description
 
 
 linkhttp://news.sohu.com/20070125
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 link
 
 
 category搜狐焦点图新闻/category
 
 
 author[EMAIL PROTECTED]
 /author
 
 
 pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate
 
 
 comments
 http://comment.news.sohu.com
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 /comment/topic.jsp?id=247833847/comments
 
 
 /item
 
  
 
 
 




Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann
Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS file,
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...

Cheers,
  Chris



On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:

 thx for ur reply .
mybe i didn't tell clearly .
 I want to index the item as a
 individual page .then when i search the some
thing for example nutch-open
 source, the nutch return a hit which contain

   title : nutch-open source

 description : nutch nutch nutch nutch  nutch
   url :
 http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
 the plugin parse-rss can satisfy what i need?

item
titlenutch--open
 source/title
   description

nutch nutch nutch nutch
 nutch
  /description
 
 
 
 linkhttp://lucene.apache.org/nutch/link
 
 
  categorynews
 /category
 
 
  authorkauu/author



On 1/31/07, Chris
 Mattmann [EMAIL PROTECTED] wrote:

 Hi there,

 I could most
 likely be of assistance, if you gave me some more
 information.
 For
 instance: I'm wondering if the use case you describe below is already

 supported by the current RSS parse plugin?

 The current RSS parser,
 parse-rss, does in fact index individual items
 that
 are pointed to by an
 RSS document. The items are added as Nutch Outlinks,
 and added to the
 overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below?
 Or am I missing something?

 Cheers,
   Chris



 On 1/30/07 6:01 PM,
 kauu [EMAIL PROTECTED] wrote:

  Hi folks :
 
 What's I want to
 do is to separate a rss file into several pages .
 
Just as what has
 been discussed before. I want fetch a rss page and
 index
  it as different
 documents in the index. So the searcher can search the
  Item's info as a
 individual hit.
 
   What's my opinion create a protocol for fetch the rss
 page and store it
 as
  several one which just contain one ITEM tag .but
 the unique key is the
 url ,
  so how can I store them with the ITEM's link
 tag as the unique key for a
  document.
 
So my question is how to
 realize this function in nutch-.0.8.x.
 
I've check the code of the
 plug-in protocol-http's code ,but I can't
  find the code where to store a
 page to a document. I want to separate
 the
  rss page to several ones
 before storing it as a document but several
 ones.
 
So any one can
 give me some hints?
 
  Any reply will be appreciated !
 
 
 
 

 
ITEM's structure
 
   item
 
 
  title欧洲暴风雪后发制人 致航班
 延误交通混乱(组图)/title
 
 
  description暴风雪横扫欧洲,导致多次航班延误 1
 月24日,几架民航客机在德
  国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
 的慕尼黑机场
  清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
 

 
 
  /description
 
 
 
 linkhttp://news.sohu.com/20070125
 
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 
 link
 
 
  category搜狐焦点图新闻/category
 
 
 
 author[EMAIL PROTECTED]
  /author
 
 
  pubDateThu, 25 Jan 2007
 11:29:11 +0800/pubDate
 
 
  comments
 
 http://comment.news.sohu.com
 
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 
 /comment/topic.jsp?id=247833847/comments
 
 
  /item
 
 

 





--
www.babatu.com