Thank you for all the tips. I think I need to understand better the pipeline of 
parsers and if/how their plug-in.includes order  matters.   

> On Feb 11, 2018, at 1:18 AM, Yossi Tamari <yossi.tam...@pipl.com> wrote:
> 
> Hi David,
> 
> The interfaces related to extending Nutch parser/indexer are actually very
> simple. However, finding up-to-date documented samples is not. Luckily,
> Nutch comes with plenty built-in, so my suggestion would be to pick one, and
> dive into its implementation. Then just copy its folder and use it as a
> skeleton, replacing the specific logic (and plugin metadata).
> 
> The first question you need to ask yourself is if you really want to write a
> Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I suspect that the
> default behaviour of the Nutch Parser and Indexer is useful for you, and you
> just want to add more functionality (that is what Any23 is doing). You can
> chain Filters, so your code could also leverage the Any23 logic, for
> example.
> 
> The documentation starting point is the Wiki
> (https://wiki.apache.org/nutch/). For your specific question, this is the
> most relevant page: https://wiki.apache.org/nutch/AboutPlugins.
> 
> One (old) example of writing a custom parser can be found here:
> http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I suggest you
> Google for more information as needed, but always keep in mind that things
> may have changed over time.
> 
> I think the best approach for domain-specific parsers is to have a custom
> parser that maps from the URL to the specific code. This can be just one big
> if/else, or a Map of domain->code (possibly using functional programming),
> or you can even have this map configurable in some file.
> 
> Once you have more specific questions/problems, I suggest you email
> u...@nutch.apache.org. dev@nutch.apache.org is intended for discussing code
> contributions to Nutch, as far as I understand, and I think less people see
> your messages here. (Also, more people will benefit from your questions
> there...)
> 
> In summary, from my experience, writing any one of these plugins is really
> easy (discounting your own complex logic, of course), just implementing one
> or a few methods, changing some plugin XML file, and adding your extension
> to the global build (Ant) files. But to really understand how the passed
> data looks, and what you can do with it, debugging (in local mode) is the
> ultimate tool, and in the end is much more time-efficient than looking for
> information on the web. This is partly because a lot of the data is passed
> in Map-like form, so even the JavaDoc doesn't really tell you what will be
> there (it depends on what plugins you have configured, and how you
> configured those plugins...).
> 
>    Yossi.
> 
> 
>> -----Original Message-----
>> From: David Ferrero [mailto:david.ferr...@zion.com]
>> Sent: 11 February 2018 04:00
>> To: dev@nutch.apache.org
>> Subject: Custom Parser / Indexer Starting points
>> 
>> &tldr; If I wanted to learn about the nutch pipeline at a high level, then
> write a
>> custom parser / indexer of my own where would a starting point be?
>> 
>> I have used the latest 1.x Nutch to crawl a few specific websites and been
>> disappointed with the results, even after experimenting with new html-
>> microdata capabilities with updates to Any23 project incorporated by
> Nutch, I
>> am still not (yet) excited. Bottom line is website data is not well
> structured and
>> not super friendly to algorithmic consumption (but you already knew that).
> To
>> that end, I am interested to developer custom parsers per internet domain
> in an
>> effort to capture specific domain data. It currently looks like the
> plugin.includes
>> does not allow a per domain-based approach for parser / indexer. I wonder
> if
>> someone could guide me toward a high level view of the Nutch data
> pipeline,
>> then guide me towards where to get started for creating custom parsers
> that
>> might support a per-domain approach?
>> 
>> Thanks,
>> David
> 

Reply via email to