Re: [Nutch-general] rss integration

Ernesto De Santis Mon, 11 Sep 2006 14:14:00 -0700

Thanks a lot Chris, it's working.

The key was in crawl-urlfilter.txt. Because all the url items had more 
that three '/' and filtered characters like '?'


I did comment:
[EMAIL PROTECTED]
and
#-.*(/.+?)/.*?\1/.*?\1/
lines, and it works very well.

Thanks Chris!
Ernesto.


Chris Mattmann escribió:
> Hi Ernesto,
>
>   You need to make sure that the links inside of the RSS files that are
> getting indexed are not filtered out by your url filter. For instance, say
> you had an RSS file that had the following links:
>
> http://foo.com/news/
> http://foo.bar.com/sports/
> http://bar.foo.com/breaking/news/highlights
>
> Well, you would need in your url filter to add support for each of the
> different host names and paths that you would be indexing. So, in your
> example below, I'm pretty sure that your URL filter below limits you to only
> those 2 domains, rss.cnn.com and www.cnn.com. I think that if you chanted
> your filter, for example to:
>
> +^http://([a-z0-9]*\.)*cnn.com/
>
> That might help. Ensure that the links present in the CNN RSS files fall
> within the *.cnn.com domain, otherwise, update your url filter accordingly.
>
>  More specific comments below:
>
> On 9/10/06 11:23 PM, "Ernesto De Santis" <[EMAIL PROTECTED]>
> wrote:
>
>   
>> Hi Chris
>>
>> Thanks for your response.
>> But I can't do that it works.
>>
>> All times it indexes the whole channel as one Document.
>>
>> I did these steps (to index a cnn channel):
>>
>> 1- write in my seed file, with just one seed:
>>
>> http://rss.cnn.com/rss/cnn_topstories.rss
>>     
>
> Good, that's the right thing to do.
>
>   
>> 2- include the parser:
>>
>> In the file nutch-default.xml, tag plugin.includes, I include the rss
>> parser:
>>   
>> <value>protocol-http|urlfilter-regex|parse-(rss|text|html|js)|index-basic|quer
>> y-(basic|site|url)|summary-basic|scoring-opic|index-url-category</value>
>>
>>     
>
> Perfect.
>
>   
>> 3- Accept cnn hosts
>>
>> In the file crawl.urlfilter.txt I wrote:
>> +^http://rss.cnn.com/
>> +^http://www.cnn.com/
>>     
>
> See my comments above here. I think that you need to change these.
>
>   
>> Then I run the crawler, but always I get an index with once Document.
>> I try some things more, without successes... (like set
>> db.ignore.internal.links to false, change the mimetype parsers order, I
>> did read some problem about that in a post yours)
>>
>> Do you know what I'm forgetting?
>>
>> How can I be sure if parser-rss is parsing some content?
>> Can I get some log about that?
>>     
>
> Yup, there should be some information in the nutch.log file. Do a grep for
> "parse-rss" or "RSSParser" in the log file.
>
>   
>> About outlinks, I don't understand what I must do with them. I need do
>> something with outlinks after parser-rss work?
>>     
>
> Nope. Outlinks are links coming out of a page of content. So, if there are 5
> links in a web page, or an RSS document, then there are 5 so-called
> "Outlinks" in Nutch terminology. During the parsing phase, as content is
> parsed individually, Nutch requires a parser to append any Outlinks found in
> a particular piece of content and return them back to the Fetcher so that
> they too can be crawled.
>
>
> HTH,
>   Chris
>
>   
>> Thanks a lot ... again.
>> Ernesto.
>>
>> Chris Mattmann escribió:
>>     
>>> Hi Ernesto,
>>>
>>>  The RSSParser in Nutch does in fact index the individual item links: they
>>> are added as Outlinks during each iteration in which the RSSParser is
>>> called. Both the channel text and the item text are indexed. Also, since
>>> each Item link is added as an Outlink to the list of returned Outlinks,
>>> Nutch is able to crawl many urls that can come out of a single RSS feed.
>>>
>>> HTH,
>>>   Chris
>>>
>>>
>>>
>>> On 9/10/06 5:54 PM, "Ernesto De Santis" <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>   
>>>       
>>>> Hi all
>>>>
>>>> I'm trying to integrate a rss and atom source to my nutch index.
>>>> I see that nutch has a RSSParser, but it seems that index the whole
>>>> source as one source, right?
>>>>
>>>> I want to index each item separately.
>>>> Some body do it? What's the best approach.
>>>>
>>>> I hope about do a external process to add Document's to nutch(lucene)
>>>> index using a rss fetcher like Rome. The negative point about it, is
>>>> that it isn't integrated with nutch.
>>>>
>>>> I don't know details of nutch core to hack it, I don't know if is
>>>> possible to integrate it in nutch.
>>>>
>>>> Thanks a lot!
>>>> Ernesto.
>>>>
>>>>
>>>>
>>>>
>>>> __________________________________________________
>>>> Preguntá. Respondé. Descubrí.
>>>> Todo lo que querías saber, y lo que ni imaginabas,
>>>> está en Yahoo! Respuestas (Beta).
>>>> ¡Probalo ya! 
>>>> http://www.yahoo.com.ar/respuestas
>>>>
>>>>     
>>>>         
>>>
>>>   
>>>       
>>
>>
>> __________________________________________________
>> Preguntá. Respondé. Descubrí.
>> Todo lo que querías saber, y lo que ni imaginabas,
>> está en Yahoo! Respuestas (Beta).
>> ¡Probalo ya! 
>> http://www.yahoo.com.ar/respuestas
>>
>>     
>
>
>
>   

        
        
                
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] rss integration

Reply via email to