Hi: Yes, it would be very interesting. Let me know what Emir says
Nima On Thu, Sep 25, 2014 at 12:43 PM, Albinscode <[email protected]> wrote: > Oh thanks Nima, I did found this topic last year but I thought the project > was dead. I think there is a little reference in the nutch wiki too I > cannot find it now. > > It looks like we have the same xsl approach so it can be interesting to > share. I'll try to contact Emir while continuing documenting my small > plugin. > > Thanks again for the valuable information! > > 2014-09-25 19:19 GMT+02:00 Nima Falaki <[email protected]>: > >> And the reason why I think this is because of this ticket (Look at the >> conversation at the bottom between Emmanuel and Lewis John) >> >> https://issues.apache.org/jira/browse/NUTCH-978 >> >> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <[email protected]> >> wrote: >> >>> Hi Julien: >>> >>> I was under the impression that the nutch community was going to use a >>> generic xls parser? This one. >>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is >>> the nutch community going to use this? >>> >>> >>> >>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche < >>> [email protected]> wrote: >>> >>>> Hi Albin, >>>> >>>> You don't have to have a separate plugin for each html structure you >>>> want to parse. You can have a single plugin with multiple HTMLParseFilters. >>>> >>>> Having a generic extractor with the extraction logic configured in an >>>> external file is definitely a good idea and would make a great contribution >>>> to the project. In a nutshell, you haven't missed anything and that wheel >>>> definitely needs inventing ;-) >>>> >>>> Best >>>> >>>> Julien >>>> >>>> >>>> On 25 September 2014 09:24, Albin Vigier <[email protected]> wrote: >>>> >>>>> Hello everybody, >>>>> >>>>> I'm just wondering if it is possible to fetch specific metadata with >>>>> an existing nutch plugin. >>>>> >>>>> Let's take an example. >>>>> I want to extract some metadata from "div" or "td" tags from html >>>>> pages that have specific ids and name them the way I like (this is >>>>> done at parser time). >>>>> Then, at indexer time, I would use index-metadata (a very good plugin) >>>>> to add my custom metadata. >>>>> >>>>> Currently from what I've seen on the wiki and by quickly analyzing >>>>> plugins I suppose I have to code my own plugin each time I've got a >>>>> new site (with a new html structure). I've already done that by using >>>>> a node walker in a custom htmlParseFilter but the extraction can be a >>>>> little bit boring :) >>>>> >>>>> So on my side i've coded a little plugin that enables me to specify >>>>> xpaths in an xml file. But before diving into more functionalities I'm >>>>> just wondering if I did not missed something. >>>>> This work allowed me to explore some nutch aspects but I don't want to >>>>> reinvent the wheel or miss something. >>>>> >>>>> Albin >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Open Source Solutions for Text Engineering >>>> >>>> http://digitalpebble.blogspot.com/ >>>> http://www.digitalpebble.com >>>> http://twitter.com/digitalpebble >>>> >>> >>> >>> >>> -- >>> >>> >>> >>> Nima Falaki >>> Software Engineer >>> [email protected] >>> >>> >> >> >> -- >> >> >> >> Nima Falaki >> Software Engineer >> [email protected] >> >> > -- Nima Falaki Software Engineer [email protected]

