Hi, Vinci.

I have found that there is already a feed parser in the nutch trunk which is
similar to mine but better.

I think you can have a try.  Since the Rome lib is better than
commons-feedparser, I think it should be
the right plugin to parse the RSS/Atom feed.

Best regards.

sishen

On Mon, Mar 24, 2008 at 8:12 PM, Vinci <[EMAIL PROTECTED]> wrote:

>
> Hi sishen,
>
> You should, atom feed is broken for quite a long time.
> If you don't want to replace the origin plugin, just use another name.
> Especially when your plugin only work for atom feed, I think you should
> use
> name like parse-atom or atom-parser....please refer to the rss parser
> plugin
> for the naming convention
>
> Follow up: After I check more feeds I crawled, except the broken
> character,
> I found not all the title get the mis-parsing: some <title> text is parsed
> correctly, some doesn't, but both are well-formed...
>
> Thank you,
> Vinci
>
>
> sishen wrote:
> >
> > I also prefer title than description.
> >
> > Also, I found there is some problems to parse the atom feed with the lib
> > "commons-feedparser".
> > I have implemented a new plugin to fix the problem with
> > rome<https://rome.dev.java.net/>.
> >
> >
> > But i doubt whether should I submit it to the nutch trunk?
> >
> > Best regards.
> >
> > sishen
> >
> > On Mon, Mar 24, 2008 at 3:36 PM, Vinci <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> Hi all,
> >> I found that the rss parser plugin is using the content text in
> >> <description> as anchor text but not the <title> - so that it always
> >> index
> >> the description, but the title text is always not indexed or used as
> >> anchor
> >> text.
> >>
> >> But actually the title is much more valuable and should be used as
> anchor
> >> text.
> >>
> >> Is this a bug or a misunderstanding of RSS? If this is a bug, can
> anybody
> >> post in JIRA ?
> >>
> >> Thank you for your attention.
> >> --
> >> View this message in context:
> >> http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16246578.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16249932.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Reply via email to