And the reason why I think this is because of this ticket (Look at the
conversation at the bottom between Emmanuel and Lewis John)

https://issues.apache.org/jira/browse/NUTCH-978

On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <[email protected]> wrote:

> Hi Julien:
>
> I was under the impression that the nutch community was going to use a
> generic xls parser? This one.
> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is the
> nutch community going to use this?
>
>
>
> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
> [email protected]> wrote:
>
>> Hi Albin,
>>
>> You don't have to have a separate plugin for each html structure you want
>> to parse. You can have a single plugin with multiple HTMLParseFilters.
>>
>> Having a generic extractor with the extraction logic configured in an
>> external file is definitely a good idea and would make a great contribution
>> to the project. In a nutshell, you haven't missed anything and that wheel
>> definitely needs inventing ;-)
>>
>> Best
>>
>> Julien
>>
>>
>> On 25 September 2014 09:24, Albin Vigier <[email protected]> wrote:
>>
>>> Hello everybody,
>>>
>>> I'm just wondering if it is possible to fetch specific metadata with
>>> an existing nutch plugin.
>>>
>>> Let's take an example.
>>> I want to extract some metadata from "div" or "td" tags from html
>>> pages that have specific ids and name them the way I like (this is
>>> done at parser time).
>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>> to add my custom metadata.
>>>
>>> Currently from what I've seen on the wiki and by quickly analyzing
>>> plugins I suppose I have to code my own plugin each time I've got a
>>> new site (with a new html structure). I've already done that by using
>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>> little bit boring :)
>>>
>>> So on my side i've coded a little plugin that enables me to specify
>>> xpaths in an xml file. But before diving into more functionalities I'm
>>> just wondering if I did not missed something.
>>> This work allowed me to explore some nutch aspects but I don't want to
>>> reinvent the wheel or miss something.
>>>
>>> Albin
>>>
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>
>
> --
>
>
>
> Nima Falaki
> Software Engineer
> [email protected]
>
>


-- 



Nima Falaki
Software Engineer
[email protected]

Reply via email to