The plug-in.includes order does not matter. To define the order of HtmlParseFilters, use the property htmlparsefilter.order. To define the order of Parsers, use the file conf/parse-plugins.xml. Note that once a single Parser returns a result, the following parsers will not be run.
> -----Original Message----- > From: David Ferrero [mailto:[email protected]] > Sent: 12 February 2018 06:23 > To: [email protected] > Subject: Re: Custom Parser / Indexer Starting points > > Thank you for all the tips. I think I need to understand better the pipeline of > parsers and if/how their plug-in.includes order matters. > > > On Feb 11, 2018, at 1:18 AM, Yossi Tamari <[email protected]> wrote: > > > > Hi David, > > > > The interfaces related to extending Nutch parser/indexer are actually > > very simple. However, finding up-to-date documented samples is not. > > Luckily, Nutch comes with plenty built-in, so my suggestion would be > > to pick one, and dive into its implementation. Then just copy its > > folder and use it as a skeleton, replacing the specific logic (and plugin > metadata). > > > > The first question you need to ask yourself is if you really want to > > write a Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I > > suspect that the default behaviour of the Nutch Parser and Indexer is > > useful for you, and you just want to add more functionality (that is > > what Any23 is doing). You can chain Filters, so your code could also > > leverage the Any23 logic, for example. > > > > The documentation starting point is the Wiki > > (https://wiki.apache.org/nutch/). For your specific question, this is > > the most relevant page: https://wiki.apache.org/nutch/AboutPlugins. > > > > One (old) example of writing a custom parser can be found here: > > http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I > > suggest you Google for more information as needed, but always keep in > > mind that things may have changed over time. > > > > I think the best approach for domain-specific parsers is to have a > > custom parser that maps from the URL to the specific code. This can be > > just one big if/else, or a Map of domain->code (possibly using > > functional programming), or you can even have this map configurable in some > file. > > > > Once you have more specific questions/problems, I suggest you email > > [email protected]. [email protected] is intended for discussing > > code contributions to Nutch, as far as I understand, and I think less > > people see your messages here. (Also, more people will benefit from > > your questions > > there...) > > > > In summary, from my experience, writing any one of these plugins is > > really easy (discounting your own complex logic, of course), just > > implementing one or a few methods, changing some plugin XML file, and > > adding your extension to the global build (Ant) files. But to really > > understand how the passed data looks, and what you can do with it, > > debugging (in local mode) is the ultimate tool, and in the end is much > > more time-efficient than looking for information on the web. This is > > partly because a lot of the data is passed in Map-like form, so even > > the JavaDoc doesn't really tell you what will be there (it depends on > > what plugins you have configured, and how you configured those plugins...). > > > > Yossi. > > > > > >> -----Original Message----- > >> From: David Ferrero [mailto:[email protected]] > >> Sent: 11 February 2018 04:00 > >> To: [email protected] > >> Subject: Custom Parser / Indexer Starting points > >> > >> &tldr; If I wanted to learn about the nutch pipeline at a high level, > >> then > > write a > >> custom parser / indexer of my own where would a starting point be? > >> > >> I have used the latest 1.x Nutch to crawl a few specific websites and > >> been disappointed with the results, even after experimenting with new > >> html- microdata capabilities with updates to Any23 project > >> incorporated by > > Nutch, I > >> am still not (yet) excited. Bottom line is website data is not well > > structured and > >> not super friendly to algorithmic consumption (but you already knew that). > > To > >> that end, I am interested to developer custom parsers per internet > >> domain > > in an > >> effort to capture specific domain data. It currently looks like the > > plugin.includes > >> does not allow a per domain-based approach for parser / indexer. I > >> wonder > > if > >> someone could guide me toward a high level view of the Nutch data > > pipeline, > >> then guide me towards where to get started for creating custom > >> parsers > > that > >> might support a per-domain approach? > >> > >> Thanks, > >> David > >

