[
https://issues.apache.org/jira/browse/NUTCH-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kirby Bohling updated NUTCH-1068:
---------------------------------
Attachment: automaton.diff
I am not the copyright holder, so I don't believe I can grant a license. This
is all based upon code used or written by the Lucene project. Thus I believe
it is eligible for inclusion in the ASF projects.
> Automaton performance improvements based on Lucene code base
> ------------------------------------------------------------
>
> Key: NUTCH-1068
> URL: https://issues.apache.org/jira/browse/NUTCH-1068
> Project: Nutch
> Issue Type: Improvement
> Reporter: Kirby Bohling
> Attachments: automaton.diff
>
>
> The Lucene team maintains a modified Automaton library cut down to precisely
> what they need. It can have significant performance enhancements.
> I am attempting to backport and shepherd a patch for the original Automaton
> library.
> The original Lucene code is here:
> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
> The Lucene code is likely slightly faster, as it includes several micro
> optimizations I removed to avoid having to request re-license permission. I
> would definitely performance test using the Lucene RegEx vs. the patched
> code. The Lucene code also uses code points not characters, which might make
> a difference for UTF-16 vs. UTF-32 in obscure cases (I believe the Lucene
> code builds a UTF-32 clean DFA for accuracy, and then translates it to a
> UTF-8 DFA for performance but I'm not 100% sure. I don't need/use any of
> that code, and currently really only worried about ASCII DFAs).
> When making heavy use of the NFA-to-DFA transformation, I see a 4x speed up.
> It likely has a 1.5-2x speed up for regular expression execution from what I
> can tell. The Nutch backend uses this code in a couple of places, and it
> likely would lead to performance benefits for those areas.
> I will attach my backported version for the Automaton 1.11-7 release. While
> I don't own any of the copyright, all of the code is copyrighted under the
> BSD license, or the ASF 2.0 license. It is pretty obviously approved for ASF
> usage. I am not checking that the patch is usable as I'm not the copyright
> holder. If that is an issue, I'll say "yes", I just don't believe I have any
> legal standing to do so. I don't want to create licensing issues for the ASF.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira