@Chris Thank you for your suggestion too. As requested I've created the https://issues.apache.org/jira/browse/NUTCH-1870 and provided a patch.
Feel free to give me feedbacks. I'll continue work on my branch ;) 2014-10-03 10:03 GMT+02:00 Albinscode <[email protected]>: > Hello Sebastian, > > Thank you for having taken a look to the global mechanism. > I've tried to make as simple as possible to focus on "what to extract?". > > Currently I've got lots of needs (and so ideas). The code will > naturally evolve (support of XSLT 2.0) and I would be happy to fully > give this code to the community. > > Of course, I'll create a JIRA and prepare a patch. I'll take the time > to provide it as clean as possible. > > Thank you for your interest. > > 2014-10-03 6:59 GMT+02:00 Mattmann, Chris A (3980) > <[email protected]>: >> Agree with Sebastian, if we could make this part of Nutch it >> would be great, as I think it would help us do page scraping >> a lot better! >> >> What do you think Albin? >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >> -----Original Message----- >> From: Sebastian Nagel <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Thursday, October 2, 2014 at 3:03 PM >> To: "[email protected]" <[email protected]> >> Subject: Re: Generic xsl parser plugin >> >>>Hi Albin, >>> >>>the plugin looks very nice! >>>I like the clean and extensible way how >>>fields are filled by XPath statements. >>>To use XSLT functions to do the cleansing >>>of extracted text (you hardly ever can do without!) >>>is an excellent idea! >>> >>>I hope to find the time soon to look at it more detail >>>and give it a trial. >>> >>>Even more I would like to see the plugin as part of Nutch. >>>Are you willing to open a Jira for it and provide a patch? >>> >>>Thanks a lot, >>>Sebastian >>> >>>On 10/02/2014 10:26 AM, Albinscode wrote: >>>> Hi all, >>>> >>>> I've created two posts on my blog to describe and use the xsl plugin: >>>> >>>>http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/ >>>> http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/ >>>> >>>> The source code is available on >>>>https://code.google.com/p/nutch-parse-xsl-plugin/. >>>> I'll update the google code wiki to gather information from my blog. >>>> >>>> If you have any comment feel free. >>>> As I'm currently using it to crawl different web sites related to >>>>searching friends I'll have lots >>>> of examples to provide. >>>> >>>> Have a nice day! >>>> >>>> Albin >>>> >>>> 2014-09-25 16:18 GMT+02:00 Albin Vigier <[email protected] >>>><mailto:[email protected]>>: >>>> >>>> Ok, perfect, so I didn't waste my time. I'm finishing my basic >>>>implementation for my own needs >>>> and I'll post it to google code or other repo if the community is >>>>interested. >>>> I'll work on a small doc too. >>>> Thank you for your answer. >>>> >>>> On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche >>>><[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Hi Albin, >>>> >>>> You don't have to have a separate plugin for each html >>>>structure you want to parse. You can >>>> have a single plugin with multiple HTMLParseFilters. >>>> >>>> Having a generic extractor with the extraction logic configured >>>>in an external file is >>>> definitely a good idea and would make a great contribution to >>>>the project. In a nutshell, >>>> you haven't missed anything and that wheel definitely needs >>>>inventing ;-) >>>> >>>> Best >>>> >>>> Julien >>>> >>>> >>>> On 25 September 2014 09:24, Albin Vigier <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Hello everybody, >>>> >>>> I'm just wondering if it is possible to fetch specific >>>>metadata with >>>> an existing nutch plugin. >>>> >>>> Let's take an example. >>>> I want to extract some metadata from "div" or "td" tags >>>>from html >>>> pages that have specific ids and name them the way I like >>>>(this is >>>> done at parser time). >>>> Then, at indexer time, I would use index-metadata (a very >>>>good plugin) >>>> to add my custom metadata. >>>> >>>> Currently from what I've seen on the wiki and by quickly >>>>analyzing >>>> plugins I suppose I have to code my own plugin each time >>>>I've got a >>>> new site (with a new html structure). I've already done >>>>that by using >>>> a node walker in a custom htmlParseFilter but the >>>>extraction can be a >>>> little bit boring :) >>>> >>>> So on my side i've coded a little plugin that enables me to >>>>specify >>>> xpaths in an xml file. But before diving into more >>>>functionalities I'm >>>> just wondering if I did not missed something. >>>> This work allowed me to explore some nutch aspects but I >>>>don't want to >>>> reinvent the wheel or miss something. >>>> >>>> Albin >>>> >>>> >>>> >>>> >>>> -- >>>> * >>>> *Open Source Solutions for Text Engineering >>>> >>>> http://digitalpebble.blogspot.com/ >>>> http://www.digitalpebble.com >>>> http://twitter.com/digitalpebble >>>> >>>> >>>> >>> >>

