Ok, perfect, so I didn't waste my time. I'm finishing my basic implementation for my own needs and I'll post it to google code or other repo if the community is interested. I'll work on a small doc too. Thank you for your answer.
On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi Albin, > > You don't have to have a separate plugin for each html structure you want > to parse. You can have a single plugin with multiple HTMLParseFilters. > > Having a generic extractor with the extraction logic configured in an > external file is definitely a good idea and would make a great contribution > to the project. In a nutshell, you haven't missed anything and that wheel > definitely needs inventing ;-) > > Best > > Julien > > > On 25 September 2014 09:24, Albin Vigier <albinsc...@gmail.com> wrote: > >> Hello everybody, >> >> I'm just wondering if it is possible to fetch specific metadata with >> an existing nutch plugin. >> >> Let's take an example. >> I want to extract some metadata from "div" or "td" tags from html >> pages that have specific ids and name them the way I like (this is >> done at parser time). >> Then, at indexer time, I would use index-metadata (a very good plugin) >> to add my custom metadata. >> >> Currently from what I've seen on the wiki and by quickly analyzing >> plugins I suppose I have to code my own plugin each time I've got a >> new site (with a new html structure). I've already done that by using >> a node walker in a custom htmlParseFilter but the extraction can be a >> little bit boring :) >> >> So on my side i've coded a little plugin that enables me to specify >> xpaths in an xml file. But before diving into more functionalities I'm >> just wondering if I did not missed something. >> This work allowed me to explore some nutch aspects but I don't want to >> reinvent the wheel or miss something. >> >> Albin >> > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >