Re: Generic xsl parser plugin

Nima Falaki Thu, 25 Sep 2014 20:47:58 -0700

Hi:

Yes, it would be very interesting. Let me know what Emir says


Nima

On Thu, Sep 25, 2014 at 12:43 PM, Albinscode <[email protected]> wrote:

> Oh thanks Nima, I did found this topic last year but I thought the project
> was dead. I think there is a little reference in the nutch wiki too I
> cannot find it now.
>
> It looks like we have the same xsl approach so it can be interesting to
> share. I'll try to contact Emir while continuing documenting my small
> plugin.
>
> Thanks again for the valuable information!
>
> 2014-09-25 19:19 GMT+02:00 Nima Falaki <[email protected]>:
>
>> And the reason why I think this is because of this ticket (Look at the
>> conversation at the bottom between Emmanuel and Lewis John)
>>
>> https://issues.apache.org/jira/browse/NUTCH-978
>>
>> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <[email protected]>
>> wrote:
>>
>>> Hi Julien:
>>>
>>> I was under the impression that the nutch community was going to use a
>>> generic xls parser? This one.
>>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is
>>> the nutch community going to use this?
>>>
>>>
>>>
>>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>>> [email protected]> wrote:
>>>
>>>> Hi Albin,
>>>>
>>>> You don't have to have a separate plugin for each html structure you
>>>> want to parse. You can have a single plugin with multiple HTMLParseFilters.
>>>>
>>>> Having a generic extractor with the extraction logic configured in an
>>>> external file is definitely a good idea and would make a great contribution
>>>> to the project. In a nutshell, you haven't missed anything and that wheel
>>>> definitely needs inventing ;-)
>>>>
>>>> Best
>>>>
>>>> Julien
>>>>
>>>>
>>>> On 25 September 2014 09:24, Albin Vigier <[email protected]> wrote:
>>>>
>>>>> Hello everybody,
>>>>>
>>>>> I'm just wondering if it is possible to fetch specific metadata with
>>>>> an existing nutch plugin.
>>>>>
>>>>> Let's take an example.
>>>>> I want to extract some metadata from "div" or "td" tags from html
>>>>> pages that have specific ids and name them the way I like (this is
>>>>> done at parser time).
>>>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>>>> to add my custom metadata.
>>>>>
>>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>>> new site (with a new html structure). I've already done that by using
>>>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>>>> little bit boring :)
>>>>>
>>>>> So on my side i've coded a little plugin that enables me to specify
>>>>> xpaths in an xml file. But before diving into more functionalities I'm
>>>>> just wondering if I did not missed something.
>>>>> This work allowed me to explore some nutch aspects but I don't want to
>>>>> reinvent the wheel or miss something.
>>>>>
>>>>> Albin
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Open Source Solutions for Text Engineering
>>>>
>>>> http://digitalpebble.blogspot.com/
>>>> http://www.digitalpebble.com
>>>> http://twitter.com/digitalpebble
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> Nima Falaki
>>> Software Engineer
>>> [email protected]
>>>
>>>
>>
>>
>> --
>>
>>
>>
>> Nima Falaki
>> Software Engineer
>> [email protected]
>>
>>
>


-- 



Nima Falaki
Software Engineer
[email protected]

Re: Generic xsl parser plugin

Reply via email to