[ https://issues.apache.org/jira/browse/NUTCH-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149214#comment-13149214 ]
Lewis John McGibbney commented on NUTCH-1068: --------------------------------------------- Hi Kirby. I understand that this was a while ago now but as no-one has commented I thought we may as well keep something moving after our conversation of dev lists. Can you explain how you propose to integrate this into Nutch code? I am unsure where to start as it is a github patch. It's also a huge patch. The performance stuff you mention sounds appealing but I really don't know enough just now, especially as I can't use this patch with trunk code. Thank you > Automaton performance improvements based on Lucene code base > ------------------------------------------------------------ > > Key: NUTCH-1068 > URL: https://issues.apache.org/jira/browse/NUTCH-1068 > Project: Nutch > Issue Type: Improvement > Reporter: Kirby Bohling > Attachments: automaton.diff > > > The Lucene team maintains a modified Automaton library cut down to precisely > what they need. It can have significant performance enhancements. > I am attempting to backport and shepherd a patch for the original Automaton > library. > The original Lucene code is here: > http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ > The Lucene code is likely slightly faster, as it includes several micro > optimizations I removed to avoid having to request re-license permission. I > would definitely performance test using the Lucene RegEx vs. the patched > code. The Lucene code also uses code points not characters, which might make > a difference for UTF-16 vs. UTF-32 in obscure cases (I believe the Lucene > code builds a UTF-32 clean DFA for accuracy, and then translates it to a > UTF-8 DFA for performance but I'm not 100% sure. I don't need/use any of > that code, and currently really only worried about ASCII DFAs). > When making heavy use of the NFA-to-DFA transformation, I see a 4x speed up. > It likely has a 1.5-2x speed up for regular expression execution from what I > can tell. The Nutch backend uses this code in a couple of places, and it > likely would lead to performance benefits for those areas. > I will attach my backported version for the Automaton 1.11-7 release. While > I don't own any of the copyright, all of the code is copyrighted under the > BSD license, or the ASF 2.0 license. It is pretty obviously approved for ASF > usage. I am not checking that the patch is usable as I'm not the copyright > holder. If that is an issue, I'll say "yes", I just don't believe I have any > legal standing to do so. I don't want to create licensing issues for the ASF. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira