Last thing I wrote a how to use it document. :) On Sep 26, 2014 6:52 AM, "Talat Uyarer" <[email protected]> wrote:
> Hi all, > > I made some changes Emir's plugin for completable with 2.x That is useful > If you need I can share my fork. > > Talat > On Sep 26, 2014 6:47 AM, "Nima Falaki" <[email protected]> wrote: > >> Hi: >> >> Yes, it would be very interesting. Let me know what Emir says >> >> Nima >> >> On Thu, Sep 25, 2014 at 12:43 PM, Albinscode <[email protected]> >> wrote: >> >>> Oh thanks Nima, I did found this topic last year but I thought the >>> project was dead. I think there is a little reference in the nutch wiki too >>> I cannot find it now. >>> >>> It looks like we have the same xsl approach so it can be interesting to >>> share. I'll try to contact Emir while continuing documenting my small >>> plugin. >>> >>> Thanks again for the valuable information! >>> >>> 2014-09-25 19:19 GMT+02:00 Nima Falaki <[email protected]>: >>> >>>> And the reason why I think this is because of this ticket (Look at the >>>> conversation at the bottom between Emmanuel and Lewis John) >>>> >>>> https://issues.apache.org/jira/browse/NUTCH-978 >>>> >>>> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <[email protected]> >>>> wrote: >>>> >>>>> Hi Julien: >>>>> >>>>> I was under the impression that the nutch community was going to use a >>>>> generic xls parser? This one. >>>>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is >>>>> the nutch community going to use this? >>>>> >>>>> >>>>> >>>>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Albin, >>>>>> >>>>>> You don't have to have a separate plugin for each html structure you >>>>>> want to parse. You can have a single plugin with multiple >>>>>> HTMLParseFilters. >>>>>> >>>>>> Having a generic extractor with the extraction logic configured in an >>>>>> external file is definitely a good idea and would make a great >>>>>> contribution >>>>>> to the project. In a nutshell, you haven't missed anything and that wheel >>>>>> definitely needs inventing ;-) >>>>>> >>>>>> Best >>>>>> >>>>>> Julien >>>>>> >>>>>> >>>>>> On 25 September 2014 09:24, Albin Vigier <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hello everybody, >>>>>>> >>>>>>> I'm just wondering if it is possible to fetch specific metadata with >>>>>>> an existing nutch plugin. >>>>>>> >>>>>>> Let's take an example. >>>>>>> I want to extract some metadata from "div" or "td" tags from html >>>>>>> pages that have specific ids and name them the way I like (this is >>>>>>> done at parser time). >>>>>>> Then, at indexer time, I would use index-metadata (a very good >>>>>>> plugin) >>>>>>> to add my custom metadata. >>>>>>> >>>>>>> Currently from what I've seen on the wiki and by quickly analyzing >>>>>>> plugins I suppose I have to code my own plugin each time I've got a >>>>>>> new site (with a new html structure). I've already done that by using >>>>>>> a node walker in a custom htmlParseFilter but the extraction can be a >>>>>>> little bit boring :) >>>>>>> >>>>>>> So on my side i've coded a little plugin that enables me to specify >>>>>>> xpaths in an xml file. But before diving into more functionalities >>>>>>> I'm >>>>>>> just wondering if I did not missed something. >>>>>>> This work allowed me to explore some nutch aspects but I don't want >>>>>>> to >>>>>>> reinvent the wheel or miss something. >>>>>>> >>>>>>> Albin >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Open Source Solutions for Text Engineering >>>>>> >>>>>> http://digitalpebble.blogspot.com/ >>>>>> http://www.digitalpebble.com >>>>>> http://twitter.com/digitalpebble >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> >>>>> >>>>> Nima Falaki >>>>> Software Engineer >>>>> [email protected] >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> >>>> >>>> Nima Falaki >>>> Software Engineer >>>> [email protected] >>>> >>>> >>> >> >> >> -- >> >> >> >> Nima Falaki >> Software Engineer >> [email protected] >> >>

