Hi Albin, the plugin looks very nice! I like the clean and extensible way how fields are filled by XPath statements. To use XSLT functions to do the cleansing of extracted text (you hardly ever can do without!) is an excellent idea!
I hope to find the time soon to look at it more detail and give it a trial. Even more I would like to see the plugin as part of Nutch. Are you willing to open a Jira for it and provide a patch? Thanks a lot, Sebastian On 10/02/2014 10:26 AM, Albinscode wrote: > Hi all, > > I've created two posts on my blog to describe and use the xsl plugin: > http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/ > http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/ > > The source code is available on > https://code.google.com/p/nutch-parse-xsl-plugin/. > I'll update the google code wiki to gather information from my blog. > > If you have any comment feel free. > As I'm currently using it to crawl different web sites related to searching > friends I'll have lots > of examples to provide. > > Have a nice day! > > Albin > > 2014-09-25 16:18 GMT+02:00 Albin Vigier <[email protected] > <mailto:[email protected]>>: > > Ok, perfect, so I didn't waste my time. I'm finishing my basic > implementation for my own needs > and I'll post it to google code or other repo if the community is > interested. > I'll work on a small doc too. > Thank you for your answer. > > On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche > <[email protected] > <mailto:[email protected]>> wrote: > > Hi Albin, > > You don't have to have a separate plugin for each html structure you > want to parse. You can > have a single plugin with multiple HTMLParseFilters. > > Having a generic extractor with the extraction logic configured in an > external file is > definitely a good idea and would make a great contribution to the > project. In a nutshell, > you haven't missed anything and that wheel definitely needs inventing > ;-) > > Best > > Julien > > > On 25 September 2014 09:24, Albin Vigier <[email protected] > <mailto:[email protected]>> wrote: > > Hello everybody, > > I'm just wondering if it is possible to fetch specific metadata > with > an existing nutch plugin. > > Let's take an example. > I want to extract some metadata from "div" or "td" tags from html > pages that have specific ids and name them the way I like (this is > done at parser time). > Then, at indexer time, I would use index-metadata (a very good > plugin) > to add my custom metadata. > > Currently from what I've seen on the wiki and by quickly analyzing > plugins I suppose I have to code my own plugin each time I've got > a > new site (with a new html structure). I've already done that by > using > a node walker in a custom htmlParseFilter but the extraction can > be a > little bit boring :) > > So on my side i've coded a little plugin that enables me to > specify > xpaths in an xml file. But before diving into more > functionalities I'm > just wondering if I did not missed something. > This work allowed me to explore some nutch aspects but I don't > want to > reinvent the wheel or miss something. > > Albin > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > > >

