[ 
https://issues.apache.org/jira/browse/NUTCH-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205385#comment-14205385
 ] 

Sebastian Nagel commented on NUTCH-1870:
----------------------------------------

Hi [~Albinscode], simple and funny example :)!

I've added a patch which
* includes boilerplate to build, test, generate javadoc
* make the tests running (but only from src/plugin/parse-xsl via "ant test")
* various minor changes
* javadoc
** added package.info in org.apache.nutch.parse.xsl
** auto-generated JAXB packages are suppressed. Or do we need javadocs for 
these classes?
* attribute "filterUrlsWithNoRule" belongs to the element "rules", right? -> 
changed in the sample

The plugin is working now! I'll continue testing with more complex transforms 
(to get the full power of XSL).

Meanwhile a few points which could require review or rework:
* load all configuration files from class path, e.g.
{code}
Reader reader = conf.getConfResourceAsReader(rulesFile);
{code}
That's important if Nutch is run via Hadoop: class and configuration files are 
wrapped into one single job file. There are no "real" files which can be load.
This also applies to running the unit tests: we cannot rely that they are 
executed from a specific working directory.
* reading config files on-demand and multiple times is not really efficient. 
It's better to read and parse all configuration files during setConf(). Sorry, 
maybe my comment before was not 100% clear at this point, but setConf() should 
be the best place:
** errors in configuration are catched early, and are less likely to oversee 
than if it happens somewhere in the middle of parsing a segment
** inside setConf() you do not take care of thread-safety
** setConf() is called only once
** parsing should be fast and there is strict timeout (30 sec. per default)
* regarding thread-safety: the trade-off should be minimal. Making RulesManager 
a local variable seems too much and is in contradiction to the previous point 
(loading config files). Wouldn't it be sufficient to make only those objects 
thread-local which are unsafe and need to be used from filter(). E.g., 
{{javax.xml.transform.Transformer}} is definitely not thread-safe (we need to 
check other javax classes). But it should be possible to get a Transformer 
without reading the xsl file again every time.
* what about fields with multiple values? A expression can match multiple 
times, but looks like only the first match is extracted.

> Generic xsl parser plugin
> -------------------------
>
>                 Key: NUTCH-1870
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1870
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>    Affects Versions: 1.9
>            Reporter: Albinscode
>             Fix For: 1.10
>
>         Attachments: NUTCH-1870-trunk-v3.patch, nutch-site.xml, 
> xsl-parse-plugin.patch, xsl-parse-plugin2.patch
>
>
> The aim of this plugin is to use XSLT to extract metadata from HTML DOM 
> structures.
> | Your Data | --> | Parse-html plugin  or TIKA plugin | --> | DOM structure | 
> --> |XSLT plugin |
>                   
>                   
> The main advantage is that:
> - You won't have to produce any java code, only XSLT and configuration
> - It can process DOM structure from DocumentFragment (@see NekoHtml and @see 
> TagSoup)
> - It is HtmlParseFilter plugin compatible and can be plugged as any other 
> plugin (parse-js, parse-swf, etc...)
> This topic has been discussed on 
> http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to