Thanks a lot Chris, it's working. The key was in crawl-urlfilter.txt. Because all the url items had more that three '/' and filtered characters like '?'
I did comment: [EMAIL PROTECTED] and #-.*(/.+?)/.*?\1/.*?\1/ lines, and it works very well. Thanks Chris! Ernesto. Chris Mattmann escribió: > Hi Ernesto, > > You need to make sure that the links inside of the RSS files that are > getting indexed are not filtered out by your url filter. For instance, say > you had an RSS file that had the following links: > > http://foo.com/news/ > http://foo.bar.com/sports/ > http://bar.foo.com/breaking/news/highlights > > Well, you would need in your url filter to add support for each of the > different host names and paths that you would be indexing. So, in your > example below, I'm pretty sure that your URL filter below limits you to only > those 2 domains, rss.cnn.com and www.cnn.com. I think that if you chanted > your filter, for example to: > > +^http://([a-z0-9]*\.)*cnn.com/ > > That might help. Ensure that the links present in the CNN RSS files fall > within the *.cnn.com domain, otherwise, update your url filter accordingly. > > More specific comments below: > > On 9/10/06 11:23 PM, "Ernesto De Santis" <[EMAIL PROTECTED]> > wrote: > > >> Hi Chris >> >> Thanks for your response. >> But I can't do that it works. >> >> All times it indexes the whole channel as one Document. >> >> I did these steps (to index a cnn channel): >> >> 1- write in my seed file, with just one seed: >> >> http://rss.cnn.com/rss/cnn_topstories.rss >> > > Good, that's the right thing to do. > > >> 2- include the parser: >> >> In the file nutch-default.xml, tag plugin.includes, I include the rss >> parser: >> >> <value>protocol-http|urlfilter-regex|parse-(rss|text|html|js)|index-basic|quer >> y-(basic|site|url)|summary-basic|scoring-opic|index-url-category</value> >> >> > > Perfect. > > >> 3- Accept cnn hosts >> >> In the file crawl.urlfilter.txt I wrote: >> +^http://rss.cnn.com/ >> +^http://www.cnn.com/ >> > > See my comments above here. I think that you need to change these. > > >> Then I run the crawler, but always I get an index with once Document. >> I try some things more, without successes... (like set >> db.ignore.internal.links to false, change the mimetype parsers order, I >> did read some problem about that in a post yours) >> >> Do you know what I'm forgetting? >> >> How can I be sure if parser-rss is parsing some content? >> Can I get some log about that? >> > > Yup, there should be some information in the nutch.log file. Do a grep for > "parse-rss" or "RSSParser" in the log file. > > >> About outlinks, I don't understand what I must do with them. I need do >> something with outlinks after parser-rss work? >> > > Nope. Outlinks are links coming out of a page of content. So, if there are 5 > links in a web page, or an RSS document, then there are 5 so-called > "Outlinks" in Nutch terminology. During the parsing phase, as content is > parsed individually, Nutch requires a parser to append any Outlinks found in > a particular piece of content and return them back to the Fetcher so that > they too can be crawled. > > > HTH, > Chris > > >> Thanks a lot ... again. >> Ernesto. >> >> Chris Mattmann escribió: >> >>> Hi Ernesto, >>> >>> The RSSParser in Nutch does in fact index the individual item links: they >>> are added as Outlinks during each iteration in which the RSSParser is >>> called. Both the channel text and the item text are indexed. Also, since >>> each Item link is added as an Outlink to the list of returned Outlinks, >>> Nutch is able to crawl many urls that can come out of a single RSS feed. >>> >>> HTH, >>> Chris >>> >>> >>> >>> On 9/10/06 5:54 PM, "Ernesto De Santis" <[EMAIL PROTECTED]> >>> wrote: >>> >>> >>> >>>> Hi all >>>> >>>> I'm trying to integrate a rss and atom source to my nutch index. >>>> I see that nutch has a RSSParser, but it seems that index the whole >>>> source as one source, right? >>>> >>>> I want to index each item separately. >>>> Some body do it? What's the best approach. >>>> >>>> I hope about do a external process to add Document's to nutch(lucene) >>>> index using a rss fetcher like Rome. The negative point about it, is >>>> that it isn't integrated with nutch. >>>> >>>> I don't know details of nutch core to hack it, I don't know if is >>>> possible to integrate it in nutch. >>>> >>>> Thanks a lot! >>>> Ernesto. >>>> >>>> >>>> >>>> >>>> __________________________________________________ >>>> Preguntá. Respondé. Descubrí. >>>> Todo lo que querías saber, y lo que ni imaginabas, >>>> está en Yahoo! Respuestas (Beta). >>>> ¡Probalo ya! >>>> http://www.yahoo.com.ar/respuestas >>>> >>>> >>>> >>> >>> >>> >> >> >> __________________________________________________ >> Preguntá. Respondé. Descubrí. >> Todo lo que querías saber, y lo que ni imaginabas, >> está en Yahoo! Respuestas (Beta). >> ¡Probalo ya! >> http://www.yahoo.com.ar/respuestas >> >> > > > > __________________________________________________ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
