Re: Generic xsl parser plugin

Albinscode Sun, 05 Oct 2014 13:10:36 -0700

@Chris Thank you for your suggestion too.

As requested I've created the
https://issues.apache.org/jira/browse/NUTCH-1870 and provided a patch.


Feel free to give me feedbacks. I'll continue work on my branch ;)

2014-10-03 10:03 GMT+02:00 Albinscode <[email protected]>:
> Hello Sebastian,
>
> Thank you for having taken a look to the global mechanism.
> I've tried to make as simple as possible to focus on "what to extract?".
>
> Currently I've got lots of needs (and so ideas). The code will
> naturally evolve (support of XSLT 2.0) and I would be happy to fully
> give this code to the community.
>
> Of course, I'll create a JIRA and prepare a patch. I'll take the time
> to provide it as clean as possible.
>
> Thank you for your interest.
>
> 2014-10-03 6:59 GMT+02:00 Mattmann, Chris A (3980)
> <[email protected]>:
>> Agree with Sebastian, if we could make this part of Nutch it
>> would be great, as I think it would help us do page scraping
>> a lot better!
>>
>> What do you think Albin?
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [email protected]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sebastian Nagel <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Thursday, October 2, 2014 at 3:03 PM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Generic xsl parser plugin
>>
>>>Hi Albin,
>>>
>>>the plugin looks very nice!
>>>I like the clean and extensible way how
>>>fields are filled by XPath statements.
>>>To use XSLT functions to do the cleansing
>>>of extracted text (you hardly ever can do without!)
>>>is an excellent idea!
>>>
>>>I hope to find the time soon to look at it more detail
>>>and give it a trial.
>>>
>>>Even more I would like to see the plugin as part of Nutch.
>>>Are you willing to open a Jira for it and provide a patch?
>>>
>>>Thanks a lot,
>>>Sebastian
>>>
>>>On 10/02/2014 10:26 AM, Albinscode wrote:
>>>> Hi all,
>>>>
>>>> I've created two posts on my blog to describe and use the xsl plugin:
>>>>
>>>>http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/
>>>> http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/
>>>>
>>>> The source code is available on
>>>>https://code.google.com/p/nutch-parse-xsl-plugin/.
>>>> I'll update the google code wiki to gather information from my blog.
>>>>
>>>> If you have any comment feel free.
>>>> As I'm currently using it to crawl different web sites related to
>>>>searching friends I'll have lots
>>>> of examples to provide.
>>>>
>>>> Have a nice day!
>>>>
>>>> Albin
>>>>
>>>> 2014-09-25 16:18 GMT+02:00 Albin Vigier <[email protected]
>>>><mailto:[email protected]>>:
>>>>
>>>>     Ok, perfect, so I didn't waste my time. I'm finishing my basic
>>>>implementation for my own needs
>>>>     and I'll post it to google code or other repo if the community is
>>>>interested.
>>>>     I'll work on a small doc too.
>>>>     Thank you for your answer.
>>>>
>>>>     On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche
>>>><[email protected]
>>>>     <mailto:[email protected]>> wrote:
>>>>
>>>>         Hi Albin,
>>>>
>>>>         You don't have to have a separate plugin for each html
>>>>structure you want to parse. You can
>>>>         have a single plugin with multiple HTMLParseFilters.
>>>>
>>>>         Having a generic extractor with the extraction logic configured
>>>>in an external file is
>>>>         definitely a good idea and would make a great contribution to
>>>>the project. In a nutshell,
>>>>         you haven't missed anything and that wheel definitely needs
>>>>inventing ;-)
>>>>
>>>>         Best
>>>>
>>>>         Julien
>>>>
>>>>
>>>>         On 25 September 2014 09:24, Albin Vigier <[email protected]
>>>>         <mailto:[email protected]>> wrote:
>>>>
>>>>             Hello everybody,
>>>>
>>>>             I'm just wondering if it is possible to fetch specific
>>>>metadata with
>>>>             an existing nutch plugin.
>>>>
>>>>             Let's take an example.
>>>>             I want to extract some metadata from "div" or "td" tags
>>>>from html
>>>>             pages that have specific ids and name them the way I like
>>>>(this is
>>>>             done at parser time).
>>>>             Then, at indexer time, I would use index-metadata (a very
>>>>good plugin)
>>>>             to add my custom metadata.
>>>>
>>>>             Currently from what I've seen on the wiki and by quickly
>>>>analyzing
>>>>             plugins I suppose I have to code my own plugin each time
>>>>I've got a
>>>>             new site (with a new html structure). I've already done
>>>>that by using
>>>>             a node walker in a custom htmlParseFilter but the
>>>>extraction can be a
>>>>             little bit boring :)
>>>>
>>>>             So on my side i've coded a little plugin that enables me to
>>>>specify
>>>>             xpaths in an xml file. But before diving into more
>>>>functionalities I'm
>>>>             just wondering if I did not missed something.
>>>>             This work allowed me to explore some nutch aspects but I
>>>>don't want to
>>>>             reinvent the wheel or miss something.
>>>>
>>>>             Albin
>>>>
>>>>
>>>>
>>>>
>>>>         --
>>>>         *
>>>>         *Open Source Solutions for Text Engineering
>>>>
>>>>         http://digitalpebble.blogspot.com/
>>>>         http://www.digitalpebble.com
>>>>         http://twitter.com/digitalpebble
>>>>
>>>>
>>>>
>>>
>>

Re: Generic xsl parser plugin

Reply via email to