Perfect, all is going fast in here ;) I've looked at Emir's code but there is a small limitation: you can only put xpath, it is not full xsl. So it doesn't fit my needs. I need to perform real transformations (with xsl:for-each and custom xsl functions, not only xpath).
Another thing, when implementing parser, I did get a problem when trying to apply xpath on already provided DocumentFragment (generated by htmlParser or tikaParser). It seems that Emir got a problem too because he is recreating the whole DOM from raw content instead of reusing it. And then he cleans up DOM nodes to XMLize it with another Html node cleaner (html cleaner) instead of already used NekoHtml or TagSoup. I think I'll post a new subject on this mailing list and ask Emir. Because it can be a performance issue on our two plugins ;) I've written some HOWTO to describe the main mecanism and comparison with NodeWalker implementation. I'm performing some cleanups and I'll upload the code: http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/ 2014-09-26 10:26 GMT+02:00 Julien Nioche <[email protected]>: > Hi Nima > > Thanks for reminding me about this JIRA issue, it hasn't been commented on > for some time and I'd forgotten about it. Judging by the discussion on > NUTCH-978 <https://issues.apache.org/jira/browse/NUTCH-978> things got > stuck when Emmanuel tried to get in touch with Emir (who in the meantime > seems to have stopped using Nutch - see > http://www.atlantbh.com/book-review-web-crawling-and-data-mining-with-apache-nutch/ > ). > > It would be a good thing to get in touch with him indeed, alternatively > Albin's plugin could be a good starting point. There clearly is a need for > such a functionality and quite a few people keen to make it happen. > > Thanks > > Julien > > > On 25 September 2014 18:19, Nima Falaki <[email protected]> wrote: > >> And the reason why I think this is because of this ticket (Look at the >> conversation at the bottom between Emmanuel and Lewis John) >> >> https://issues.apache.org/jira/browse/NUTCH-978 >> >> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <[email protected]> >> wrote: >> >>> Hi Julien: >>> >>> I was under the impression that the nutch community was going to use a >>> generic xls parser? This one. >>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is >>> the nutch community going to use this? >>> >>> >>> >>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche < >>> [email protected]> wrote: >>> >>>> Hi Albin, >>>> >>>> You don't have to have a separate plugin for each html structure you >>>> want to parse. You can have a single plugin with multiple HTMLParseFilters. >>>> >>>> Having a generic extractor with the extraction logic configured in an >>>> external file is definitely a good idea and would make a great contribution >>>> to the project. In a nutshell, you haven't missed anything and that wheel >>>> definitely needs inventing ;-) >>>> >>>> Best >>>> >>>> Julien >>>> >>>> >>>> On 25 September 2014 09:24, Albin Vigier <[email protected]> wrote: >>>> >>>>> Hello everybody, >>>>> >>>>> I'm just wondering if it is possible to fetch specific metadata with >>>>> an existing nutch plugin. >>>>> >>>>> Let's take an example. >>>>> I want to extract some metadata from "div" or "td" tags from html >>>>> pages that have specific ids and name them the way I like (this is >>>>> done at parser time). >>>>> Then, at indexer time, I would use index-metadata (a very good plugin) >>>>> to add my custom metadata. >>>>> >>>>> Currently from what I've seen on the wiki and by quickly analyzing >>>>> plugins I suppose I have to code my own plugin each time I've got a >>>>> new site (with a new html structure). I've already done that by using >>>>> a node walker in a custom htmlParseFilter but the extraction can be a >>>>> little bit boring :) >>>>> >>>>> So on my side i've coded a little plugin that enables me to specify >>>>> xpaths in an xml file. But before diving into more functionalities I'm >>>>> just wondering if I did not missed something. >>>>> This work allowed me to explore some nutch aspects but I don't want to >>>>> reinvent the wheel or miss something. >>>>> >>>>> Albin >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Open Source Solutions for Text Engineering >>>> >>>> http://digitalpebble.blogspot.com/ >>>> http://www.digitalpebble.com >>>> http://twitter.com/digitalpebble >>>> >>> >>> >>> >>> -- >>> >>> >>> >>> Nima Falaki >>> Software Engineer >>> [email protected] >>> >>> >> >> >> -- >> >> >> >> Nima Falaki >> Software Engineer >> [email protected] >> >> > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

