[
https://issues.apache.org/jira/browse/NUTCH-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche resolved NUTCH-1068.
----------------------------------
Resolution: Won't Fix
Marking it as won't fix. We currently use the BRICS lib as a dependency and
this patch would mean having our own copy of the code i.e more to look after,
won't benefit from others' improvements, own bugs, or at the very least keep it
in sync with the Lucene version.
Have any of the improvements made by the Lucene people made it to the brics lib
since? Alternatively we could look at using the modified version from Lucene as
a dependency but either way it would be better than maintaining our own copy.
> Automaton performance improvements based on Lucene code base
> ------------------------------------------------------------
>
> Key: NUTCH-1068
> URL: https://issues.apache.org/jira/browse/NUTCH-1068
> Project: Nutch
> Issue Type: Improvement
> Reporter: Kirby Bohling
> Attachments: automaton.diff
>
>
> The Lucene team maintains a modified Automaton library cut down to precisely
> what they need. It can have significant performance enhancements.
> I am attempting to backport and shepherd a patch for the original Automaton
> library.
> The original Lucene code is here:
> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
> The Lucene code is likely slightly faster, as it includes several micro
> optimizations I removed to avoid having to request re-license permission. I
> would definitely performance test using the Lucene RegEx vs. the patched
> code. The Lucene code also uses code points not characters, which might make
> a difference for UTF-16 vs. UTF-32 in obscure cases (I believe the Lucene
> code builds a UTF-32 clean DFA for accuracy, and then translates it to a
> UTF-8 DFA for performance but I'm not 100% sure. I don't need/use any of
> that code, and currently really only worried about ASCII DFAs).
> When making heavy use of the NFA-to-DFA transformation, I see a 4x speed up.
> It likely has a 1.5-2x speed up for regular expression execution from what I
> can tell. The Nutch backend uses this code in a couple of places, and it
> likely would lead to performance benefits for those areas.
> I will attach my backported version for the Automaton 1.11-7 release. While
> I don't own any of the copyright, all of the code is copyrighted under the
> BSD license, or the ASF 2.0 license. It is pretty obviously approved for ASF
> usage. I am not checking that the patch is usable as I'm not the copyright
> holder. If that is an issue, I'll say "yes", I just don't believe I have any
> legal standing to do so. I don't want to create licensing issues for the ASF.
--
This message was sent by Atlassian JIRA
(v6.2#6252)