Re: Generic xsl parser plugin

Talat Uyarer Thu, 25 Sep 2014 20:55:03 -0700

Last thing I wrote a how to use it document. :)
On Sep 26, 2014 6:52 AM, "Talat Uyarer" <[email protected]> wrote:


> Hi all,
>
> I made some changes Emir's plugin for completable with 2.x That is useful
> If you need I can share my fork.
>
> Talat
> On Sep 26, 2014 6:47 AM, "Nima Falaki" <[email protected]> wrote:
>
>> Hi:
>>
>> Yes, it would be very interesting. Let me know what Emir says
>>
>> Nima
>>
>> On Thu, Sep 25, 2014 at 12:43 PM, Albinscode <[email protected]>
>> wrote:
>>
>>> Oh thanks Nima, I did found this topic last year but I thought the
>>> project was dead. I think there is a little reference in the nutch wiki too
>>> I cannot find it now.
>>>
>>> It looks like we have the same xsl approach so it can be interesting to
>>> share. I'll try to contact Emir while continuing documenting my small
>>> plugin.
>>>
>>> Thanks again for the valuable information!
>>>
>>> 2014-09-25 19:19 GMT+02:00 Nima Falaki <[email protected]>:
>>>
>>>> And the reason why I think this is because of this ticket (Look at the
>>>> conversation at the bottom between Emmanuel and Lewis John)
>>>>
>>>> https://issues.apache.org/jira/browse/NUTCH-978
>>>>
>>>> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Julien:
>>>>>
>>>>> I was under the impression that the nutch community was going to use a
>>>>> generic xls parser? This one.
>>>>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is
>>>>> the nutch community going to use this?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Albin,
>>>>>>
>>>>>> You don't have to have a separate plugin for each html structure you
>>>>>> want to parse. You can have a single plugin with multiple 
>>>>>> HTMLParseFilters.
>>>>>>
>>>>>> Having a generic extractor with the extraction logic configured in an
>>>>>> external file is definitely a good idea and would make a great 
>>>>>> contribution
>>>>>> to the project. In a nutshell, you haven't missed anything and that wheel
>>>>>> definitely needs inventing ;-)
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> Julien
>>>>>>
>>>>>>
>>>>>> On 25 September 2014 09:24, Albin Vigier <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello everybody,
>>>>>>>
>>>>>>> I'm just wondering if it is possible to fetch specific metadata with
>>>>>>> an existing nutch plugin.
>>>>>>>
>>>>>>> Let's take an example.
>>>>>>> I want to extract some metadata from "div" or "td" tags from html
>>>>>>> pages that have specific ids and name them the way I like (this is
>>>>>>> done at parser time).
>>>>>>> Then, at indexer time, I would use index-metadata (a very good
>>>>>>> plugin)
>>>>>>> to add my custom metadata.
>>>>>>>
>>>>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>>>>> new site (with a new html structure). I've already done that by using
>>>>>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>>>>>> little bit boring :)
>>>>>>>
>>>>>>> So on my side i've coded a little plugin that enables me to specify
>>>>>>> xpaths in an xml file. But before diving into more functionalities
>>>>>>> I'm
>>>>>>> just wondering if I did not missed something.
>>>>>>> This work allowed me to explore some nutch aspects but I don't want
>>>>>>> to
>>>>>>> reinvent the wheel or miss something.
>>>>>>>
>>>>>>> Albin
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Open Source Solutions for Text Engineering
>>>>>>
>>>>>> http://digitalpebble.blogspot.com/
>>>>>> http://www.digitalpebble.com
>>>>>> http://twitter.com/digitalpebble
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>> Nima Falaki
>>>>> Software Engineer
>>>>> [email protected]
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>> Nima Falaki
>>>> Software Engineer
>>>> [email protected]
>>>>
>>>>
>>>
>>
>>
>> --
>>
>>
>>
>> Nima Falaki
>> Software Engineer
>> [email protected]
>>
>>

Re: Generic xsl parser plugin

Reply via email to