And the reason why I think this is because of this ticket (Look at the conversation at the bottom between Emmanuel and Lewis John)
https://issues.apache.org/jira/browse/NUTCH-978 On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <[email protected]> wrote: > Hi Julien: > > I was under the impression that the nutch community was going to use a > generic xls parser? This one. > http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is the > nutch community going to use this? > > > > On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche < > [email protected]> wrote: > >> Hi Albin, >> >> You don't have to have a separate plugin for each html structure you want >> to parse. You can have a single plugin with multiple HTMLParseFilters. >> >> Having a generic extractor with the extraction logic configured in an >> external file is definitely a good idea and would make a great contribution >> to the project. In a nutshell, you haven't missed anything and that wheel >> definitely needs inventing ;-) >> >> Best >> >> Julien >> >> >> On 25 September 2014 09:24, Albin Vigier <[email protected]> wrote: >> >>> Hello everybody, >>> >>> I'm just wondering if it is possible to fetch specific metadata with >>> an existing nutch plugin. >>> >>> Let's take an example. >>> I want to extract some metadata from "div" or "td" tags from html >>> pages that have specific ids and name them the way I like (this is >>> done at parser time). >>> Then, at indexer time, I would use index-metadata (a very good plugin) >>> to add my custom metadata. >>> >>> Currently from what I've seen on the wiki and by quickly analyzing >>> plugins I suppose I have to code my own plugin each time I've got a >>> new site (with a new html structure). I've already done that by using >>> a node walker in a custom htmlParseFilter but the extraction can be a >>> little bit boring :) >>> >>> So on my side i've coded a little plugin that enables me to specify >>> xpaths in an xml file. But before diving into more functionalities I'm >>> just wondering if I did not missed something. >>> This work allowed me to explore some nutch aspects but I don't want to >>> reinvent the wheel or miss something. >>> >>> Albin >>> >> >> >> >> -- >> >> Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> > > > > -- > > > > Nima Falaki > Software Engineer > [email protected] > > -- Nima Falaki Software Engineer [email protected]

