You should start with the extension points that Nutch offers. These are very similar to OSGI and Eclipse plug-ins.
Once you understand this, but start writing your parse. Test and implement. Hope this helps. Best regards, Evert Wagenaar. http://www.ejwagenaar.com/ On Mon, 12 Feb 2018 at 08:56 Yossi Tamari <[email protected]> wrote: > The plug-in.includes order does not matter. > To define the order of HtmlParseFilters, use the property > htmlparsefilter.order. > To define the order of Parsers, use the file conf/parse-plugins.xml. Note > that once a single Parser returns a result, the following parsers will not > be run. > > > -----Original Message----- > > From: David Ferrero [mailto:[email protected]] > > Sent: 12 February 2018 06:23 > > To: [email protected] > > Subject: Re: Custom Parser / Indexer Starting points > > > > Thank you for all the tips. I think I need to understand better the > pipeline of > > parsers and if/how their plug-in.includes order matters. > > > > > On Feb 11, 2018, at 1:18 AM, Yossi Tamari <[email protected]> > wrote: > > > > > > Hi David, > > > > > > The interfaces related to extending Nutch parser/indexer are actually > > > very simple. However, finding up-to-date documented samples is not. > > > Luckily, Nutch comes with plenty built-in, so my suggestion would be > > > to pick one, and dive into its implementation. Then just copy its > > > folder and use it as a skeleton, replacing the specific logic (and > plugin > > metadata). > > > > > > The first question you need to ask yourself is if you really want to > > > write a Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I > > > suspect that the default behaviour of the Nutch Parser and Indexer is > > > useful for you, and you just want to add more functionality (that is > > > what Any23 is doing). You can chain Filters, so your code could also > > > leverage the Any23 logic, for example. > > > > > > The documentation starting point is the Wiki > > > (https://wiki.apache.org/nutch/). For your specific question, this is > > > the most relevant page: https://wiki.apache.org/nutch/AboutPlugins. > > > > > > One (old) example of writing a custom parser can be found here: > > > http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I > > > suggest you Google for more information as needed, but always keep in > > > mind that things may have changed over time. > > > > > > I think the best approach for domain-specific parsers is to have a > > > custom parser that maps from the URL to the specific code. This can be > > > just one big if/else, or a Map of domain->code (possibly using > > > functional programming), or you can even have this map configurable in > some > > file. > > > > > > Once you have more specific questions/problems, I suggest you email > > > [email protected]. [email protected] is intended for discussing > > > code contributions to Nutch, as far as I understand, and I think less > > > people see your messages here. (Also, more people will benefit from > > > your questions > > > there...) > > > > > > In summary, from my experience, writing any one of these plugins is > > > really easy (discounting your own complex logic, of course), just > > > implementing one or a few methods, changing some plugin XML file, and > > > adding your extension to the global build (Ant) files. But to really > > > understand how the passed data looks, and what you can do with it, > > > debugging (in local mode) is the ultimate tool, and in the end is much > > > more time-efficient than looking for information on the web. This is > > > partly because a lot of the data is passed in Map-like form, so even > > > the JavaDoc doesn't really tell you what will be there (it depends on > > > what plugins you have configured, and how you configured those > plugins...). > > > > > > Yossi. > > > > > > > > >> -----Original Message----- > > >> From: David Ferrero [mailto:[email protected]] > > >> Sent: 11 February 2018 04:00 > > >> To: [email protected] > > >> Subject: Custom Parser / Indexer Starting points > > >> > > >> &tldr; If I wanted to learn about the nutch pipeline at a high level, > > >> then > > > write a > > >> custom parser / indexer of my own where would a starting point be? > > >> > > >> I have used the latest 1.x Nutch to crawl a few specific websites and > > >> been disappointed with the results, even after experimenting with new > > >> html- microdata capabilities with updates to Any23 project > > >> incorporated by > > > Nutch, I > > >> am still not (yet) excited. Bottom line is website data is not well > > > structured and > > >> not super friendly to algorithmic consumption (but you already knew > that). > > > To > > >> that end, I am interested to developer custom parsers per internet > > >> domain > > > in an > > >> effort to capture specific domain data. It currently looks like the > > > plugin.includes > > >> does not allow a per domain-based approach for parser / indexer. I > > >> wonder > > > if > > >> someone could guide me toward a high level view of the Nutch data > > > pipeline, > > >> then guide me towards where to get started for creating custom > > >> parsers > > > that > > >> might support a per-domain approach? > > >> > > >> Thanks, > > >> David > > > > > -- Sent from Gmail IPad

