[
https://issues.apache.org/jira/browse/NUTCH-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205385#comment-14205385
]
Sebastian Nagel commented on NUTCH-1870:
----------------------------------------
Hi [~Albinscode], simple and funny example :)!
I've added a patch which
* includes boilerplate to build, test, generate javadoc
* make the tests running (but only from src/plugin/parse-xsl via "ant test")
* various minor changes
* javadoc
** added package.info in org.apache.nutch.parse.xsl
** auto-generated JAXB packages are suppressed. Or do we need javadocs for
these classes?
* attribute "filterUrlsWithNoRule" belongs to the element "rules", right? ->
changed in the sample
The plugin is working now! I'll continue testing with more complex transforms
(to get the full power of XSL).
Meanwhile a few points which could require review or rework:
* load all configuration files from class path, e.g.
{code}
Reader reader = conf.getConfResourceAsReader(rulesFile);
{code}
That's important if Nutch is run via Hadoop: class and configuration files are
wrapped into one single job file. There are no "real" files which can be load.
This also applies to running the unit tests: we cannot rely that they are
executed from a specific working directory.
* reading config files on-demand and multiple times is not really efficient.
It's better to read and parse all configuration files during setConf(). Sorry,
maybe my comment before was not 100% clear at this point, but setConf() should
be the best place:
** errors in configuration are catched early, and are less likely to oversee
than if it happens somewhere in the middle of parsing a segment
** inside setConf() you do not take care of thread-safety
** setConf() is called only once
** parsing should be fast and there is strict timeout (30 sec. per default)
* regarding thread-safety: the trade-off should be minimal. Making RulesManager
a local variable seems too much and is in contradiction to the previous point
(loading config files). Wouldn't it be sufficient to make only those objects
thread-local which are unsafe and need to be used from filter(). E.g.,
{{javax.xml.transform.Transformer}} is definitely not thread-safe (we need to
check other javax classes). But it should be possible to get a Transformer
without reading the xsl file again every time.
* what about fields with multiple values? A expression can match multiple
times, but looks like only the first match is extracted.
> Generic xsl parser plugin
> -------------------------
>
> Key: NUTCH-1870
> URL: https://issues.apache.org/jira/browse/NUTCH-1870
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, parser
> Affects Versions: 1.9
> Reporter: Albinscode
> Fix For: 1.10
>
> Attachments: NUTCH-1870-trunk-v3.patch, nutch-site.xml,
> xsl-parse-plugin.patch, xsl-parse-plugin2.patch
>
>
> The aim of this plugin is to use XSLT to extract metadata from HTML DOM
> structures.
> | Your Data | --> | Parse-html plugin or TIKA plugin | --> | DOM structure |
> --> |XSLT plugin |
>
>
> The main advantage is that:
> - You won't have to produce any java code, only XSLT and configuration
> - It can process DOM structure from DocumentFragment (@see NekoHtml and @see
> TagSoup)
> - It is HtmlParseFilter plugin compatible and can be plugged as any other
> plugin (parse-js, parse-swf, etc...)
> This topic has been discussed on
> http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)