Re: Custom Parser / Indexer Starting points

Evert Wagenaar Sat, 17 Feb 2018 14:34:21 -0800

You should start with the extension points that Nutch offers. These are
very similar to OSGI and Eclipse plug-ins.


Once you understand this, but  start writing your parse. Test and
implement.


Hope this helps.

Best regards,


Evert Wagenaar.

http://www.ejwagenaar.com/

On Mon, 12 Feb 2018 at 08:56 Yossi Tamari <[email protected]> wrote:

> The plug-in.includes order does not matter.
> To define the order of HtmlParseFilters, use the property
> htmlparsefilter.order.
> To define the order of Parsers, use the file conf/parse-plugins.xml. Note
> that once a single Parser returns a result, the following parsers will not
> be run.
>
> > -----Original Message-----
> > From: David Ferrero [mailto:[email protected]]
> > Sent: 12 February 2018 06:23
> > To: [email protected]
> > Subject: Re: Custom Parser / Indexer Starting points
> >
> > Thank you for all the tips. I think I need to understand better the
> pipeline of
> > parsers and if/how their plug-in.includes order  matters.
> >
> > > On Feb 11, 2018, at 1:18 AM, Yossi Tamari <[email protected]>
> wrote:
> > >
> > > Hi David,
> > >
> > > The interfaces related to extending Nutch parser/indexer are actually
> > > very simple. However, finding up-to-date documented samples is not.
> > > Luckily, Nutch comes with plenty built-in, so my suggestion would be
> > > to pick one, and dive into its implementation. Then just copy its
> > > folder and use it as a skeleton, replacing the specific logic (and
> plugin
> > metadata).
> > >
> > > The first question you need to ask yourself is if you really want to
> > > write a Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I
> > > suspect that the default behaviour of the Nutch Parser and Indexer is
> > > useful for you, and you just want to add more functionality (that is
> > > what Any23 is doing). You can chain Filters, so your code could also
> > > leverage the Any23 logic, for example.
> > >
> > > The documentation starting point is the Wiki
> > > (https://wiki.apache.org/nutch/). For your specific question, this is
> > > the most relevant page: https://wiki.apache.org/nutch/AboutPlugins.
> > >
> > > One (old) example of writing a custom parser can be found here:
> > > http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I
> > > suggest you Google for more information as needed, but always keep in
> > > mind that things may have changed over time.
> > >
> > > I think the best approach for domain-specific parsers is to have a
> > > custom parser that maps from the URL to the specific code. This can be
> > > just one big if/else, or a Map of domain->code (possibly using
> > > functional programming), or you can even have this map configurable in
> some
> > file.
> > >
> > > Once you have more specific questions/problems, I suggest you email
> > > [email protected]. [email protected] is intended for discussing
> > > code contributions to Nutch, as far as I understand, and I think less
> > > people see your messages here. (Also, more people will benefit from
> > > your questions
> > > there...)
> > >
> > > In summary, from my experience, writing any one of these plugins is
> > > really easy (discounting your own complex logic, of course), just
> > > implementing one or a few methods, changing some plugin XML file, and
> > > adding your extension to the global build (Ant) files. But to really
> > > understand how the passed data looks, and what you can do with it,
> > > debugging (in local mode) is the ultimate tool, and in the end is much
> > > more time-efficient than looking for information on the web. This is
> > > partly because a lot of the data is passed in Map-like form, so even
> > > the JavaDoc doesn't really tell you what will be there (it depends on
> > > what plugins you have configured, and how you configured those
> plugins...).
> > >
> > >    Yossi.
> > >
> > >
> > >> -----Original Message-----
> > >> From: David Ferrero [mailto:[email protected]]
> > >> Sent: 11 February 2018 04:00
> > >> To: [email protected]
> > >> Subject: Custom Parser / Indexer Starting points
> > >>
> > >> &tldr; If I wanted to learn about the nutch pipeline at a high level,
> > >> then
> > > write a
> > >> custom parser / indexer of my own where would a starting point be?
> > >>
> > >> I have used the latest 1.x Nutch to crawl a few specific websites and
> > >> been disappointed with the results, even after experimenting with new
> > >> html- microdata capabilities with updates to Any23 project
> > >> incorporated by
> > > Nutch, I
> > >> am still not (yet) excited. Bottom line is website data is not well
> > > structured and
> > >> not super friendly to algorithmic consumption (but you already knew
> that).
> > > To
> > >> that end, I am interested to developer custom parsers per internet
> > >> domain
> > > in an
> > >> effort to capture specific domain data. It currently looks like the
> > > plugin.includes
> > >> does not allow a per domain-based approach for parser / indexer. I
> > >> wonder
> > > if
> > >> someone could guide me toward a high level view of the Nutch data
> > > pipeline,
> > >> then guide me towards where to get started for creating custom
> > >> parsers
> > > that
> > >> might support a per-domain approach?
> > >>
> > >> Thanks,
> > >> David
> > >
>
> --
Sent from Gmail IPad

Re: Custom Parser / Indexer Starting points

Reply via email to