Hi all, I made some changes Emir's plugin for completable with 2.x That is useful If you need I can share my fork.
Talat On Sep 26, 2014 6:47 AM, "Nima Falaki" <nfal...@popsugar.com> wrote: > Hi: > > Yes, it would be very interesting. Let me know what Emir says > > Nima > > On Thu, Sep 25, 2014 at 12:43 PM, Albinscode <albinsc...@gmail.com> wrote: > >> Oh thanks Nima, I did found this topic last year but I thought the >> project was dead. I think there is a little reference in the nutch wiki too >> I cannot find it now. >> >> It looks like we have the same xsl approach so it can be interesting to >> share. I'll try to contact Emir while continuing documenting my small >> plugin. >> >> Thanks again for the valuable information! >> >> 2014-09-25 19:19 GMT+02:00 Nima Falaki <nfal...@popsugar.com>: >> >>> And the reason why I think this is because of this ticket (Look at the >>> conversation at the bottom between Emmanuel and Lewis John) >>> >>> https://issues.apache.org/jira/browse/NUTCH-978 >>> >>> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nfal...@popsugar.com> >>> wrote: >>> >>>> Hi Julien: >>>> >>>> I was under the impression that the nutch community was going to use a >>>> generic xls parser? This one. >>>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is >>>> the nutch community going to use this? >>>> >>>> >>>> >>>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche < >>>> lists.digitalpeb...@gmail.com> wrote: >>>> >>>>> Hi Albin, >>>>> >>>>> You don't have to have a separate plugin for each html structure you >>>>> want to parse. You can have a single plugin with multiple >>>>> HTMLParseFilters. >>>>> >>>>> Having a generic extractor with the extraction logic configured in an >>>>> external file is definitely a good idea and would make a great >>>>> contribution >>>>> to the project. In a nutshell, you haven't missed anything and that wheel >>>>> definitely needs inventing ;-) >>>>> >>>>> Best >>>>> >>>>> Julien >>>>> >>>>> >>>>> On 25 September 2014 09:24, Albin Vigier <albinsc...@gmail.com> wrote: >>>>> >>>>>> Hello everybody, >>>>>> >>>>>> I'm just wondering if it is possible to fetch specific metadata with >>>>>> an existing nutch plugin. >>>>>> >>>>>> Let's take an example. >>>>>> I want to extract some metadata from "div" or "td" tags from html >>>>>> pages that have specific ids and name them the way I like (this is >>>>>> done at parser time). >>>>>> Then, at indexer time, I would use index-metadata (a very good plugin) >>>>>> to add my custom metadata. >>>>>> >>>>>> Currently from what I've seen on the wiki and by quickly analyzing >>>>>> plugins I suppose I have to code my own plugin each time I've got a >>>>>> new site (with a new html structure). I've already done that by using >>>>>> a node walker in a custom htmlParseFilter but the extraction can be a >>>>>> little bit boring :) >>>>>> >>>>>> So on my side i've coded a little plugin that enables me to specify >>>>>> xpaths in an xml file. But before diving into more functionalities I'm >>>>>> just wondering if I did not missed something. >>>>>> This work allowed me to explore some nutch aspects but I don't want to >>>>>> reinvent the wheel or miss something. >>>>>> >>>>>> Albin >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Open Source Solutions for Text Engineering >>>>> >>>>> http://digitalpebble.blogspot.com/ >>>>> http://www.digitalpebble.com >>>>> http://twitter.com/digitalpebble >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> >>>> >>>> Nima Falaki >>>> Software Engineer >>>> nfal...@popsugar.com >>>> >>>> >>> >>> >>> -- >>> >>> >>> >>> Nima Falaki >>> Software Engineer >>> nfal...@popsugar.com >>> >>> >> > > > -- > > > > Nima Falaki > Software Engineer > nfal...@popsugar.com > >