Hello and thank you for this review. You noticed things about thread safety that I was not aware of and that's true: for a matter of optimization (not creating each time a new transformer) I didn't see this thread safety issue. Your review is really accurate and I especially appreciate the list of links you provided to illustrate your remarks. I'll update the code to follow your recommandations and wait for the next points 😂. Thanks again, Albin Le 9 oct. 2014 21:52, "Sebastian Nagel (JIRA)" <[email protected]> a écrit :
> > [ > https://issues.apache.org/jira/browse/NUTCH-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165632#comment-14165632 > ] > > Sebastian Nagel commented on NUTCH-1870: > ---------------------------------------- > > Thanks, [~Albinscode], for the patch. Looks nice, code is well formatted, > ... I'll continue testing, but a few first comments: > * could load resources in setConf(conf) and not do it on-demand in the > "filter" method: > ** setConf() is called early, so failures in reading configuration > resources are reported soon > ** filter() may be called concurrently because for every plugin only one > instance is hold per extension point > * thread-safety: the filter() method must be thread-safe, and so must be > all used object instances. [Transformer| > http://docs.oracle.com/javase/7/docs/api/javax/xml/transform/Transformer.html] > instances are not safe and may not be shared by threads. That's also true > for other DOM/XML related classes, cf. [1| > http://mail-archives.apache.org/mod_mbox/nutch-user/201301.mbox/%[email protected]%3E], > [2| > http://mail-archives.apache.org/mod_mbox/nutch-user/201301.mbox/%[email protected]%3E], > or NUTCH-1596. Possible solutions are, e.g., to make this variables local > or [thread local| > http://docs.oracle.com/javase/7/docs/api/java/lang/ThreadLocal.html]. > > > there are some unit tests strongly related to sites I'm crawling > * would be better to take sample pages where we are sure not to violate > any copyright > > > Generic xsl parser plugin > > ------------------------- > > > > Key: NUTCH-1870 > > URL: https://issues.apache.org/jira/browse/NUTCH-1870 > > Project: Nutch > > Issue Type: New Feature > > Components: indexer, parser > > Affects Versions: 1.9 > > Reporter: Albinscode > > Fix For: 1.10 > > > > Attachments: xsl-parse-plugin.patch > > > > > > The aim of this plugin is to use XSLT to extract metadata from HTML DOM > structures. > > | Your Data | --> | Parse-html plugin or TIKA plugin | --> | DOM > structure | --> |XSLT plugin | > > > > > > The main advantage is that: > > - You won't have to produce any java code, only XSLT and configuration > > - It can process DOM structure from DocumentFragment (@see NekoHtml and > @see TagSoup) > > - It is HtmlParseFilter plugin compatible and can be plugged as any > other plugin (parse-js, parse-swf, etc...) > > This topic has been discussed on > http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >

