Hi, Vinci. I have found that there is already a feed parser in the nutch trunk which is similar to mine but better.
I think you can have a try. Since the Rome lib is better than commons-feedparser, I think it should be the right plugin to parse the RSS/Atom feed. Best regards. sishen On Mon, Mar 24, 2008 at 8:12 PM, Vinci <[EMAIL PROTECTED]> wrote: > > Hi sishen, > > You should, atom feed is broken for quite a long time. > If you don't want to replace the origin plugin, just use another name. > Especially when your plugin only work for atom feed, I think you should > use > name like parse-atom or atom-parser....please refer to the rss parser > plugin > for the naming convention > > Follow up: After I check more feeds I crawled, except the broken > character, > I found not all the title get the mis-parsing: some <title> text is parsed > correctly, some doesn't, but both are well-formed... > > Thank you, > Vinci > > > sishen wrote: > > > > I also prefer title than description. > > > > Also, I found there is some problems to parse the atom feed with the lib > > "commons-feedparser". > > I have implemented a new plugin to fix the problem with > > rome<https://rome.dev.java.net/>. > > > > > > But i doubt whether should I submit it to the nutch trunk? > > > > Best regards. > > > > sishen > > > > On Mon, Mar 24, 2008 at 3:36 PM, Vinci <[EMAIL PROTECTED]> wrote: > > > >> > >> Hi all, > >> I found that the rss parser plugin is using the content text in > >> <description> as anchor text but not the <title> - so that it always > >> index > >> the description, but the title text is always not indexed or used as > >> anchor > >> text. > >> > >> But actually the title is much more valuable and should be used as > anchor > >> text. > >> > >> Is this a bug or a misunderstanding of RSS? If this is a bug, can > anybody > >> post in JIRA ? > >> > >> Thank you for your attention. > >> -- > >> View this message in context: > >> http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16246578.html > >> Sent from the Nutch - User mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16249932.html > Sent from the Nutch - User mailing list archive at Nabble.com. > >
