Re: Generic xsl parser plugin

Talat Uyarer Thu, 25 Sep 2014 20:53:06 -0700

Hi all,

I made some changes Emir's plugin for completable with 2.x That is useful
If you need I can share my fork.


Talat
On Sep 26, 2014 6:47 AM, "Nima Falaki" <[email protected]> wrote:

> Hi:
>
> Yes, it would be very interesting. Let me know what Emir says
>
> Nima
>
> On Thu, Sep 25, 2014 at 12:43 PM, Albinscode <[email protected]> wrote:
>
>> Oh thanks Nima, I did found this topic last year but I thought the
>> project was dead. I think there is a little reference in the nutch wiki too
>> I cannot find it now.
>>
>> It looks like we have the same xsl approach so it can be interesting to
>> share. I'll try to contact Emir while continuing documenting my small
>> plugin.
>>
>> Thanks again for the valuable information!
>>
>> 2014-09-25 19:19 GMT+02:00 Nima Falaki <[email protected]>:
>>
>>> And the reason why I think this is because of this ticket (Look at the
>>> conversation at the bottom between Emmanuel and Lewis John)
>>>
>>> https://issues.apache.org/jira/browse/NUTCH-978
>>>
>>> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <[email protected]>
>>> wrote:
>>>
>>>> Hi Julien:
>>>>
>>>> I was under the impression that the nutch community was going to use a
>>>> generic xls parser? This one.
>>>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is
>>>> the nutch community going to use this?
>>>>
>>>>
>>>>
>>>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Albin,
>>>>>
>>>>> You don't have to have a separate plugin for each html structure you
>>>>> want to parse. You can have a single plugin with multiple 
>>>>> HTMLParseFilters.
>>>>>
>>>>> Having a generic extractor with the extraction logic configured in an
>>>>> external file is definitely a good idea and would make a great 
>>>>> contribution
>>>>> to the project. In a nutshell, you haven't missed anything and that wheel
>>>>> definitely needs inventing ;-)
>>>>>
>>>>> Best
>>>>>
>>>>> Julien
>>>>>
>>>>>
>>>>> On 25 September 2014 09:24, Albin Vigier <[email protected]> wrote:
>>>>>
>>>>>> Hello everybody,
>>>>>>
>>>>>> I'm just wondering if it is possible to fetch specific metadata with
>>>>>> an existing nutch plugin.
>>>>>>
>>>>>> Let's take an example.
>>>>>> I want to extract some metadata from "div" or "td" tags from html
>>>>>> pages that have specific ids and name them the way I like (this is
>>>>>> done at parser time).
>>>>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>>>>> to add my custom metadata.
>>>>>>
>>>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>>>> new site (with a new html structure). I've already done that by using
>>>>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>>>>> little bit boring :)
>>>>>>
>>>>>> So on my side i've coded a little plugin that enables me to specify
>>>>>> xpaths in an xml file. But before diving into more functionalities I'm
>>>>>> just wondering if I did not missed something.
>>>>>> This work allowed me to explore some nutch aspects but I don't want to
>>>>>> reinvent the wheel or miss something.
>>>>>>
>>>>>> Albin
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Open Source Solutions for Text Engineering
>>>>>
>>>>> http://digitalpebble.blogspot.com/
>>>>> http://www.digitalpebble.com
>>>>> http://twitter.com/digitalpebble
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>> Nima Falaki
>>>> Software Engineer
>>>> [email protected]
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> Nima Falaki
>>> Software Engineer
>>> [email protected]
>>>
>>>
>>
>
>
> --
>
>
>
> Nima Falaki
> Software Engineer
> [email protected]
>
>

Re: Generic xsl parser plugin

Reply via email to