Hi Julien: I was under the impression that the nutch community was going to use a generic xls parser? This one. http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is the nutch community going to use this?
On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche < [email protected]> wrote: > Hi Albin, > > You don't have to have a separate plugin for each html structure you want > to parse. You can have a single plugin with multiple HTMLParseFilters. > > Having a generic extractor with the extraction logic configured in an > external file is definitely a good idea and would make a great contribution > to the project. In a nutshell, you haven't missed anything and that wheel > definitely needs inventing ;-) > > Best > > Julien > > > On 25 September 2014 09:24, Albin Vigier <[email protected]> wrote: > >> Hello everybody, >> >> I'm just wondering if it is possible to fetch specific metadata with >> an existing nutch plugin. >> >> Let's take an example. >> I want to extract some metadata from "div" or "td" tags from html >> pages that have specific ids and name them the way I like (this is >> done at parser time). >> Then, at indexer time, I would use index-metadata (a very good plugin) >> to add my custom metadata. >> >> Currently from what I've seen on the wiki and by quickly analyzing >> plugins I suppose I have to code my own plugin each time I've got a >> new site (with a new html structure). I've already done that by using >> a node walker in a custom htmlParseFilter but the extraction can be a >> little bit boring :) >> >> So on my side i've coded a little plugin that enables me to specify >> xpaths in an xml file. But before diving into more functionalities I'm >> just wondering if I did not missed something. >> This work allowed me to explore some nutch aspects but I don't want to >> reinvent the wheel or miss something. >> >> Albin >> > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Nima Falaki Software Engineer [email protected]

