&tldr; If I wanted to learn about the nutch pipeline at a high level, then 
write a custom parser / indexer of my own where would a starting point be?

I have used the latest 1.x Nutch to crawl a few specific websites and been 
disappointed with the results, even after experimenting with new html-microdata 
capabilities with updates to Any23 project incorporated by Nutch, I am still 
not (yet) excited. Bottom line is website data is not well structured and not 
super friendly to algorithmic consumption (but you already knew that). To that 
end, I am interested to developer custom parsers per internet domain in an 
effort to capture specific domain data. It currently looks like the 
plugin.includes does not allow a per domain-based approach for parser / 
indexer. I wonder if someone could guide me toward a high level view of the 
Nutch data pipeline, then guide me towards where to get started for creating 
custom parsers that might support a per-domain approach?

Thanks,
David

Reply via email to