[ https://issues.apache.org/jira/browse/NUTCH-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174212#comment-14174212 ]
Sebastian Nagel commented on NUTCH-1870: ---------------------------------------- > 1. I'll regenerate one patch soon. If Sebastian has nothing more to add, > i'll provide it at end of the week. Please, go on. There are enough changes, to wait for... > 2. it's hard to give my baby you'll get mentioned in CHANGES.txt :) > 3. (implement own indexing filter instead of using index-metadata to add > extracted fields) It's better if plugins do not depend on other plugins. Parse-xsl is more powerful than parse-metatags (but more difficult to configure). So if you need parse-xsl, you'll probably also use it simple meta tag extraction. Neko converts element names to uppercase, that's why {{xpath = xpath.toUpperCase();}}, right? However, that breaks XPath statements containing attributes (canonically lowercase, cf. [[1|http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-5DFED1F0]]). To selectively convert only element names in XPath statements would require to parse them - hard (if impossible) without a library. Also, {{.toUpperCase()}} without explicit locale (here: Locale.ENGLISH) is sensitive to the system's locale, see NUTCH-1807. > Generic xsl parser plugin > ------------------------- > > Key: NUTCH-1870 > URL: https://issues.apache.org/jira/browse/NUTCH-1870 > Project: Nutch > Issue Type: New Feature > Components: indexer, parser > Affects Versions: 1.9 > Reporter: Albinscode > Fix For: 1.10 > > Attachments: xsl-parse-plugin.patch > > > The aim of this plugin is to use XSLT to extract metadata from HTML DOM > structures. > | Your Data | --> | Parse-html plugin or TIKA plugin | --> | DOM structure | > --> |XSLT plugin | > > > The main advantage is that: > - You won't have to produce any java code, only XSLT and configuration > - It can process DOM structure from DocumentFragment (@see NekoHtml and @see > TagSoup) > - It is HtmlParseFilter plugin compatible and can be plugged as any other > plugin (parse-js, parse-swf, etc...) > This topic has been discussed on > http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)