Albinscode created NUTCH-1870:
---------------------------------

             Summary: Generic xsl parser plugin
                 Key: NUTCH-1870
                 URL: https://issues.apache.org/jira/browse/NUTCH-1870
             Project: Nutch
          Issue Type: New Feature
          Components: indexer, parser
    Affects Versions: 1.9
            Reporter: Albinscode
             Fix For: 1.9


The aim of this plugin is to use XSLT to extract metadata from HTML DOM 
structures.

| Your Data | --> | Parse-html plugin  or TIKA plugin | --> | DOM structure | 
--> |XSLT plugin |
                  
                  
The main advantage is that:
- You won't have to produce any java code, only XSLT and configuration
- It can process DOM structure from DocumentFragment (@see NekoHtml and @see 
TagSoup)
- It is HtmlParseFilter plugin compatible and can be plugged as any other 
plugin (parse-js, parse-swf, etc...)

This topic has been discussed on 
http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to