Thank you for all the tips. I think I need to understand better the pipeline of parsers and if/how their plug-in.includes order matters.
> On Feb 11, 2018, at 1:18 AM, Yossi Tamari <[email protected]> wrote: > > Hi David, > > The interfaces related to extending Nutch parser/indexer are actually very > simple. However, finding up-to-date documented samples is not. Luckily, > Nutch comes with plenty built-in, so my suggestion would be to pick one, and > dive into its implementation. Then just copy its folder and use it as a > skeleton, replacing the specific logic (and plugin metadata). > > The first question you need to ask yourself is if you really want to write a > Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I suspect that the > default behaviour of the Nutch Parser and Indexer is useful for you, and you > just want to add more functionality (that is what Any23 is doing). You can > chain Filters, so your code could also leverage the Any23 logic, for > example. > > The documentation starting point is the Wiki > (https://wiki.apache.org/nutch/). For your specific question, this is the > most relevant page: https://wiki.apache.org/nutch/AboutPlugins. > > One (old) example of writing a custom parser can be found here: > http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I suggest you > Google for more information as needed, but always keep in mind that things > may have changed over time. > > I think the best approach for domain-specific parsers is to have a custom > parser that maps from the URL to the specific code. This can be just one big > if/else, or a Map of domain->code (possibly using functional programming), > or you can even have this map configurable in some file. > > Once you have more specific questions/problems, I suggest you email > [email protected]. [email protected] is intended for discussing code > contributions to Nutch, as far as I understand, and I think less people see > your messages here. (Also, more people will benefit from your questions > there...) > > In summary, from my experience, writing any one of these plugins is really > easy (discounting your own complex logic, of course), just implementing one > or a few methods, changing some plugin XML file, and adding your extension > to the global build (Ant) files. But to really understand how the passed > data looks, and what you can do with it, debugging (in local mode) is the > ultimate tool, and in the end is much more time-efficient than looking for > information on the web. This is partly because a lot of the data is passed > in Map-like form, so even the JavaDoc doesn't really tell you what will be > there (it depends on what plugins you have configured, and how you > configured those plugins...). > > Yossi. > > >> -----Original Message----- >> From: David Ferrero [mailto:[email protected]] >> Sent: 11 February 2018 04:00 >> To: [email protected] >> Subject: Custom Parser / Indexer Starting points >> >> &tldr; If I wanted to learn about the nutch pipeline at a high level, then > write a >> custom parser / indexer of my own where would a starting point be? >> >> I have used the latest 1.x Nutch to crawl a few specific websites and been >> disappointed with the results, even after experimenting with new html- >> microdata capabilities with updates to Any23 project incorporated by > Nutch, I >> am still not (yet) excited. Bottom line is website data is not well > structured and >> not super friendly to algorithmic consumption (but you already knew that). > To >> that end, I am interested to developer custom parsers per internet domain > in an >> effort to capture specific domain data. It currently looks like the > plugin.includes >> does not allow a per domain-based approach for parser / indexer. I wonder > if >> someone could guide me toward a high level view of the Nutch data > pipeline, >> then guide me towards where to get started for creating custom parsers > that >> might support a per-domain approach? >> >> Thanks, >> David >

