Hi sishen, You should, atom feed is broken for quite a long time. If you don't want to replace the origin plugin, just use another name. Especially when your plugin only work for atom feed, I think you should use name like parse-atom or atom-parser....please refer to the rss parser plugin for the naming convention
Follow up: After I check more feeds I crawled, except the broken character, I found not all the title get the mis-parsing: some <title> text is parsed correctly, some doesn't, but both are well-formed... Thank you, Vinci sishen wrote: > > I also prefer title than description. > > Also, I found there is some problems to parse the atom feed with the lib > "commons-feedparser". > I have implemented a new plugin to fix the problem with > rome<https://rome.dev.java.net/>. > > > But i doubt whether should I submit it to the nutch trunk? > > Best regards. > > sishen > > On Mon, Mar 24, 2008 at 3:36 PM, Vinci <[EMAIL PROTECTED]> wrote: > >> >> Hi all, >> I found that the rss parser plugin is using the content text in >> <description> as anchor text but not the <title> - so that it always >> index >> the description, but the title text is always not indexed or used as >> anchor >> text. >> >> But actually the title is much more valuable and should be used as anchor >> text. >> >> Is this a bug or a misunderstanding of RSS? If this is a bug, can anybody >> post in JIRA ? >> >> Thank you for your attention. >> -- >> View this message in context: >> http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16246578.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16249932.html Sent from the Nutch - User mailing list archive at Nabble.com.
