Hi David, The interfaces related to extending Nutch parser/indexer are actually very simple. However, finding up-to-date documented samples is not. Luckily, Nutch comes with plenty built-in, so my suggestion would be to pick one, and dive into its implementation. Then just copy its folder and use it as a skeleton, replacing the specific logic (and plugin metadata).
The first question you need to ask yourself is if you really want to write a Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I suspect that the default behaviour of the Nutch Parser and Indexer is useful for you, and you just want to add more functionality (that is what Any23 is doing). You can chain Filters, so your code could also leverage the Any23 logic, for example. The documentation starting point is the Wiki (https://wiki.apache.org/nutch/). For your specific question, this is the most relevant page: https://wiki.apache.org/nutch/AboutPlugins. One (old) example of writing a custom parser can be found here: http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I suggest you Google for more information as needed, but always keep in mind that things may have changed over time. I think the best approach for domain-specific parsers is to have a custom parser that maps from the URL to the specific code. This can be just one big if/else, or a Map of domain->code (possibly using functional programming), or you can even have this map configurable in some file. Once you have more specific questions/problems, I suggest you email [email protected]. [email protected] is intended for discussing code contributions to Nutch, as far as I understand, and I think less people see your messages here. (Also, more people will benefit from your questions there...) In summary, from my experience, writing any one of these plugins is really easy (discounting your own complex logic, of course), just implementing one or a few methods, changing some plugin XML file, and adding your extension to the global build (Ant) files. But to really understand how the passed data looks, and what you can do with it, debugging (in local mode) is the ultimate tool, and in the end is much more time-efficient than looking for information on the web. This is partly because a lot of the data is passed in Map-like form, so even the JavaDoc doesn't really tell you what will be there (it depends on what plugins you have configured, and how you configured those plugins...). Yossi. > -----Original Message----- > From: David Ferrero [mailto:[email protected]] > Sent: 11 February 2018 04:00 > To: [email protected] > Subject: Custom Parser / Indexer Starting points > > &tldr; If I wanted to learn about the nutch pipeline at a high level, then write a > custom parser / indexer of my own where would a starting point be? > > I have used the latest 1.x Nutch to crawl a few specific websites and been > disappointed with the results, even after experimenting with new html- > microdata capabilities with updates to Any23 project incorporated by Nutch, I > am still not (yet) excited. Bottom line is website data is not well structured and > not super friendly to algorithmic consumption (but you already knew that). To > that end, I am interested to developer custom parsers per internet domain in an > effort to capture specific domain data. It currently looks like the plugin.includes > does not allow a per domain-based approach for parser / indexer. I wonder if > someone could guide me toward a high level view of the Nutch data pipeline, > then guide me towards where to get started for creating custom parsers that > might support a per-domain approach? > > Thanks, > David

