RE: Custom Parser / Indexer Starting points

Yossi Tamari Sun, 11 Feb 2018 23:57:07 -0800

The plug-in.includes order does not matter.
To define the order of HtmlParseFilters, use the property
htmlparsefilter.order.
To define the order of Parsers, use the file conf/parse-plugins.xml. Note
that once a single Parser returns a result, the following parsers will not
be run.


> -----Original Message-----
> From: David Ferrero [mailto:[email protected]]
> Sent: 12 February 2018 06:23
> To: [email protected]
> Subject: Re: Custom Parser / Indexer Starting points
> 
> Thank you for all the tips. I think I need to understand better the
pipeline of
> parsers and if/how their plug-in.includes order  matters.
> 
> > On Feb 11, 2018, at 1:18 AM, Yossi Tamari <[email protected]> wrote:
> >
> > Hi David,
> >
> > The interfaces related to extending Nutch parser/indexer are actually
> > very simple. However, finding up-to-date documented samples is not.
> > Luckily, Nutch comes with plenty built-in, so my suggestion would be
> > to pick one, and dive into its implementation. Then just copy its
> > folder and use it as a skeleton, replacing the specific logic (and
plugin
> metadata).
> >
> > The first question you need to ask yourself is if you really want to
> > write a Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I
> > suspect that the default behaviour of the Nutch Parser and Indexer is
> > useful for you, and you just want to add more functionality (that is
> > what Any23 is doing). You can chain Filters, so your code could also
> > leverage the Any23 logic, for example.
> >
> > The documentation starting point is the Wiki
> > (https://wiki.apache.org/nutch/). For your specific question, this is
> > the most relevant page: https://wiki.apache.org/nutch/AboutPlugins.
> >
> > One (old) example of writing a custom parser can be found here:
> > http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I
> > suggest you Google for more information as needed, but always keep in
> > mind that things may have changed over time.
> >
> > I think the best approach for domain-specific parsers is to have a
> > custom parser that maps from the URL to the specific code. This can be
> > just one big if/else, or a Map of domain->code (possibly using
> > functional programming), or you can even have this map configurable in
some
> file.
> >
> > Once you have more specific questions/problems, I suggest you email
> > [email protected]. [email protected] is intended for discussing
> > code contributions to Nutch, as far as I understand, and I think less
> > people see your messages here. (Also, more people will benefit from
> > your questions
> > there...)
> >
> > In summary, from my experience, writing any one of these plugins is
> > really easy (discounting your own complex logic, of course), just
> > implementing one or a few methods, changing some plugin XML file, and
> > adding your extension to the global build (Ant) files. But to really
> > understand how the passed data looks, and what you can do with it,
> > debugging (in local mode) is the ultimate tool, and in the end is much
> > more time-efficient than looking for information on the web. This is
> > partly because a lot of the data is passed in Map-like form, so even
> > the JavaDoc doesn't really tell you what will be there (it depends on
> > what plugins you have configured, and how you configured those
plugins...).
> >
> >    Yossi.
> >
> >
> >> -----Original Message-----
> >> From: David Ferrero [mailto:[email protected]]
> >> Sent: 11 February 2018 04:00
> >> To: [email protected]
> >> Subject: Custom Parser / Indexer Starting points
> >>
> >> &tldr; If I wanted to learn about the nutch pipeline at a high level,
> >> then
> > write a
> >> custom parser / indexer of my own where would a starting point be?
> >>
> >> I have used the latest 1.x Nutch to crawl a few specific websites and
> >> been disappointed with the results, even after experimenting with new
> >> html- microdata capabilities with updates to Any23 project
> >> incorporated by
> > Nutch, I
> >> am still not (yet) excited. Bottom line is website data is not well
> > structured and
> >> not super friendly to algorithmic consumption (but you already knew
that).
> > To
> >> that end, I am interested to developer custom parsers per internet
> >> domain
> > in an
> >> effort to capture specific domain data. It currently looks like the
> > plugin.includes
> >> does not allow a per domain-based approach for parser / indexer. I
> >> wonder
> > if
> >> someone could guide me toward a high level view of the Nutch data
> > pipeline,
> >> then guide me towards where to get started for creating custom
> >> parsers
> > that
> >> might support a per-domain approach?
> >>
> >> Thanks,
> >> David
> >

RE: Custom Parser / Indexer Starting points

Reply via email to