Re: Generic xsl parser plugin

Albinscode Fri, 03 Oct 2014 01:05:18 -0700

Hello Sebastian,

Thank you for having taken a look to the global mechanism.
I've tried to make as simple as possible to focus on "what to extract?".


Currently I've got lots of needs (and so ideas). The code will
naturally evolve (support of XSLT 2.0) and I would be happy to fully
give this code to the community.

Of course, I'll create a JIRA and prepare a patch. I'll take the time
to provide it as clean as possible.

Thank you for your interest.

2014-10-03 6:59 GMT+02:00 Mattmann, Chris A (3980)
<[email protected]>:
> Agree with Sebastian, if we could make this part of Nutch it
> would be great, as I think it would help us do page scraping
> a lot better!
>
> What do you think Albin?
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Sebastian Nagel <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Thursday, October 2, 2014 at 3:03 PM
> To: "[email protected]" <[email protected]>
> Subject: Re: Generic xsl parser plugin
>
>>Hi Albin,
>>
>>the plugin looks very nice!
>>I like the clean and extensible way how
>>fields are filled by XPath statements.
>>To use XSLT functions to do the cleansing
>>of extracted text (you hardly ever can do without!)
>>is an excellent idea!
>>
>>I hope to find the time soon to look at it more detail
>>and give it a trial.
>>
>>Even more I would like to see the plugin as part of Nutch.
>>Are you willing to open a Jira for it and provide a patch?
>>
>>Thanks a lot,
>>Sebastian
>>
>>On 10/02/2014 10:26 AM, Albinscode wrote:
>>> Hi all,
>>>
>>> I've created two posts on my blog to describe and use the xsl plugin:
>>>
>>>http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/
>>> http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/
>>>
>>> The source code is available on
>>>https://code.google.com/p/nutch-parse-xsl-plugin/.
>>> I'll update the google code wiki to gather information from my blog.
>>>
>>> If you have any comment feel free.
>>> As I'm currently using it to crawl different web sites related to
>>>searching friends I'll have lots
>>> of examples to provide.
>>>
>>> Have a nice day!
>>>
>>> Albin
>>>
>>> 2014-09-25 16:18 GMT+02:00 Albin Vigier <[email protected]
>>><mailto:[email protected]>>:
>>>
>>>     Ok, perfect, so I didn't waste my time. I'm finishing my basic
>>>implementation for my own needs
>>>     and I'll post it to google code or other repo if the community is
>>>interested.
>>>     I'll work on a small doc too.
>>>     Thank you for your answer.
>>>
>>>     On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche
>>><[email protected]
>>>     <mailto:[email protected]>> wrote:
>>>
>>>         Hi Albin,
>>>
>>>         You don't have to have a separate plugin for each html
>>>structure you want to parse. You can
>>>         have a single plugin with multiple HTMLParseFilters.
>>>
>>>         Having a generic extractor with the extraction logic configured
>>>in an external file is
>>>         definitely a good idea and would make a great contribution to
>>>the project. In a nutshell,
>>>         you haven't missed anything and that wheel definitely needs
>>>inventing ;-)
>>>
>>>         Best
>>>
>>>         Julien
>>>
>>>
>>>         On 25 September 2014 09:24, Albin Vigier <[email protected]
>>>         <mailto:[email protected]>> wrote:
>>>
>>>             Hello everybody,
>>>
>>>             I'm just wondering if it is possible to fetch specific
>>>metadata with
>>>             an existing nutch plugin.
>>>
>>>             Let's take an example.
>>>             I want to extract some metadata from "div" or "td" tags
>>>from html
>>>             pages that have specific ids and name them the way I like
>>>(this is
>>>             done at parser time).
>>>             Then, at indexer time, I would use index-metadata (a very
>>>good plugin)
>>>             to add my custom metadata.
>>>
>>>             Currently from what I've seen on the wiki and by quickly
>>>analyzing
>>>             plugins I suppose I have to code my own plugin each time
>>>I've got a
>>>             new site (with a new html structure). I've already done
>>>that by using
>>>             a node walker in a custom htmlParseFilter but the
>>>extraction can be a
>>>             little bit boring :)
>>>
>>>             So on my side i've coded a little plugin that enables me to
>>>specify
>>>             xpaths in an xml file. But before diving into more
>>>functionalities I'm
>>>             just wondering if I did not missed something.
>>>             This work allowed me to explore some nutch aspects but I
>>>don't want to
>>>             reinvent the wheel or miss something.
>>>
>>>             Albin
>>>
>>>
>>>
>>>
>>>         --
>>>         *
>>>         *Open Source Solutions for Text Engineering
>>>
>>>         http://digitalpebble.blogspot.com/
>>>         http://www.digitalpebble.com
>>>         http://twitter.com/digitalpebble
>>>
>>>
>>>
>>
>

Re: Generic xsl parser plugin

Reply via email to