RE: Custom Parser / Indexer Starting points

Yossi Tamari Sun, 11 Feb 2018 00:19:03 -0800

Hi David,

The interfaces related to extending Nutch parser/indexer are actually very
simple. However, finding up-to-date documented samples is not. Luckily,
Nutch comes with plenty built-in, so my suggestion would be to pick one, and
dive into its implementation. Then just copy its folder and use it as a
skeleton, replacing the specific logic (and plugin metadata).


The first question you need to ask yourself is if you really want to write a
Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I suspect that the
default behaviour of the Nutch Parser and Indexer is useful for you, and you
just want to add more functionality (that is what Any23 is doing). You can
chain Filters, so your code could also leverage the Any23 logic, for
example.

The documentation starting point is the Wiki
(https://wiki.apache.org/nutch/). For your specific question, this is the
most relevant page: https://wiki.apache.org/nutch/AboutPlugins.

One (old) example of writing a custom parser can be found here:
http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I suggest you
Google for more information as needed, but always keep in mind that things
may have changed over time.

I think the best approach for domain-specific parsers is to have a custom
parser that maps from the URL to the specific code. This can be just one big
if/else, or a Map of domain->code (possibly using functional programming),
or you can even have this map configurable in some file.

Once you have more specific questions/problems, I suggest you email
[email protected]. [email protected] is intended for discussing code
contributions to Nutch, as far as I understand, and I think less people see
your messages here. (Also, more people will benefit from your questions
there...)

In summary, from my experience, writing any one of these plugins is really
easy (discounting your own complex logic, of course), just implementing one
or a few methods, changing some plugin XML file, and adding your extension
to the global build (Ant) files. But to really understand how the passed
data looks, and what you can do with it, debugging (in local mode) is the
ultimate tool, and in the end is much more time-efficient than looking for
information on the web. This is partly because a lot of the data is passed
in Map-like form, so even the JavaDoc doesn't really tell you what will be
there (it depends on what plugins you have configured, and how you
configured those plugins...).

        Yossi.


> -----Original Message-----
> From: David Ferrero [mailto:[email protected]]
> Sent: 11 February 2018 04:00
> To: [email protected]
> Subject: Custom Parser / Indexer Starting points
> 
> &tldr; If I wanted to learn about the nutch pipeline at a high level, then
write a
> custom parser / indexer of my own where would a starting point be?
> 
> I have used the latest 1.x Nutch to crawl a few specific websites and been
> disappointed with the results, even after experimenting with new html-
> microdata capabilities with updates to Any23 project incorporated by
Nutch, I
> am still not (yet) excited. Bottom line is website data is not well
structured and
> not super friendly to algorithmic consumption (but you already knew that).
To
> that end, I am interested to developer custom parsers per internet domain
in an
> effort to capture specific domain data. It currently looks like the
plugin.includes
> does not allow a per domain-based approach for parser / indexer. I wonder
if
> someone could guide me toward a high level view of the Nutch data
pipeline,
> then guide me towards where to get started for creating custom parsers
that
> might support a per-domain approach?
> 
> Thanks,
> David

RE: Custom Parser / Indexer Starting points

Reply via email to