On Mon, Jan 5, 2009 at 12:32 PM, Doğacan Güney <doga...@gmail.com> wrote:
> On Mon, Jan 5, 2009 at 7:00 AM, Vlad Cananau <vlad...@gmail.com> wrote:
>> Hello
>> I'm trying to make RSSParser do something simmilar to FeedParser (which
>> doesn't work quite right) - that is, instead of indexing the whole contents
>
> Why doesn't FeedParser work? Let's fix whatever is broken in it :D
>
>> of the feed, I want it to show individual items, with their respective title
>> and and proper link to the article I realize that I could index 1 depth
>> more, but I'd like to index just the feed, not the articles that go with it
>> (keep the index small and the crawl fast).
>>
>> For each item in each RSS channel (the code does not differ much for
>> getParse() of RSSParser.java) I do something like
>>
>>  Outlink[] outlinks = new Outlink[1];
>>  try{
>>   outlinks[0] = new Outlink(whichLink, theRSSItem.getTitle());
>>  } catch (Exception e) {
>>   continue;
>>  }
>>
>>  parseResult.put(
>>   whichLink,
>>   new ParseText(theRSSItem.getTitle() + theRSSItem.getDescription()),
>>   new ParseData(
>>     ParseStatus.STATUS_SUCCESS,
>>     theRSSItem.getTitle(),
>>     outlinks,
>>     new Metadata() //was content.getMetadata()
>>   )
>>  );
>>
>> The problem is, however, that only one item from the whole RSS gets into the
>> index, although in the log I can see them all ( I've tried it with feeds
>> from cnn and reuters). What happens? Why do they get overwritten in a
>> seemingly random order? The item that makes it into the index is neither the
>> first nor the last, but appears to be the same until new items appear in the
>> feed.
>>
>> Thank you,
>> Vlad
>>
>>
>
>
>
> --
> Doğacan Güney
>

when using FeedParser, not all of the feeds make it into the index.
For example, I crawl both Entertainment and Politics, but I get
results only for some of the articles.

Is there any way to check wether or not entries make it into the index?
I see, in the log "Indexing http://rss.cnn.com/... with analyzer
org.apache.nutch.analyzer.NutchDocumentAnalyzer (something)" (I'm not
able to crawl right now, since I don't have access to the machine).
But when I look for keywords specific to some of the documents, I
don't get any results :-(

Reply via email to