Hi all,

I've created two posts on my blog to describe and use the xsl plugin:
http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/
http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/

The source code is available on
https://code.google.com/p/nutch-parse-xsl-plugin/.
I'll update the google code wiki to gather information from my blog.

If you have any comment feel free.
As I'm currently using it to crawl different web sites related to searching
friends I'll have lots of examples to provide.

Have a nice day!

Albin

2014-09-25 16:18 GMT+02:00 Albin Vigier <[email protected]>:

> Ok, perfect, so I didn't waste my time. I'm finishing my basic
> implementation for my own needs and I'll post it to google code or other
> repo if the community is interested.
> I'll work on a small doc too.
> Thank you for your answer.
>
> On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche <
> [email protected]> wrote:
>
>> Hi Albin,
>>
>> You don't have to have a separate plugin for each html structure you want
>> to parse. You can have a single plugin with multiple HTMLParseFilters.
>>
>> Having a generic extractor with the extraction logic configured in an
>> external file is definitely a good idea and would make a great contribution
>> to the project. In a nutshell, you haven't missed anything and that wheel
>> definitely needs inventing ;-)
>>
>> Best
>>
>> Julien
>>
>>
>> On 25 September 2014 09:24, Albin Vigier <[email protected]> wrote:
>>
>>> Hello everybody,
>>>
>>> I'm just wondering if it is possible to fetch specific metadata with
>>> an existing nutch plugin.
>>>
>>> Let's take an example.
>>> I want to extract some metadata from "div" or "td" tags from html
>>> pages that have specific ids and name them the way I like (this is
>>> done at parser time).
>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>> to add my custom metadata.
>>>
>>> Currently from what I've seen on the wiki and by quickly analyzing
>>> plugins I suppose I have to code my own plugin each time I've got a
>>> new site (with a new html structure). I've already done that by using
>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>> little bit boring :)
>>>
>>> So on my side i've coded a little plugin that enables me to specify
>>> xpaths in an xml file. But before diving into more functionalities I'm
>>> just wondering if I did not missed something.
>>> This work allowed me to explore some nutch aspects but I don't want to
>>> reinvent the wheel or miss something.
>>>
>>> Albin
>>>
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>

Reply via email to