Hi sishen,

You should, atom feed is broken for quite a long time.
If you don't want to replace the origin plugin, just use another name.
Especially when your plugin only work for atom feed, I think you should use
name like parse-atom or atom-parser....please refer to the rss parser plugin
for the naming convention

Follow up: After I check more feeds I crawled, except the broken character,
I found not all the title get the mis-parsing: some <title> text is parsed
correctly, some doesn't, but both are well-formed...

Thank you,
Vinci


sishen wrote:
> 
> I also prefer title than description.
> 
> Also, I found there is some problems to parse the atom feed with the lib
> "commons-feedparser".
> I have implemented a new plugin to fix the problem with
> rome<https://rome.dev.java.net/>.
> 
> 
> But i doubt whether should I submit it to the nutch trunk?
> 
> Best regards.
> 
> sishen
> 
> On Mon, Mar 24, 2008 at 3:36 PM, Vinci <[EMAIL PROTECTED]> wrote:
> 
>>
>> Hi all,
>> I found that the rss parser plugin is using the content text in
>> <description> as anchor text but not the <title> - so that it always
>> index
>> the description, but the title text is always not indexed or used as
>> anchor
>> text.
>>
>> But actually the title is much more valuable and should be used as anchor
>> text.
>>
>> Is this a bug or a misunderstanding of RSS? If this is a bug, can anybody
>> post in JIRA ?
>>
>> Thank you for your attention.
>> --
>> View this message in context:
>> http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16246578.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16249932.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to