https://issues.apache.org/jira/browse/NUTCH-1068
Issue created, patch attached. Once I hear back from the author about getting it included in the upstream library, I'll update the issue. I'm really not able to pursue directly, as I'm not much of a Nutch user at the moment. I've lurked on the list because there is some good info, and I previously used Nutch as part of a R&D project at work. I use Lucene and the Automaton library quite a bit, and found out about the Automaton library here. It's been a great find for us, so hopefully this is a way I can contribute back. Either way, the ASF likely already has better code that Nutch could just pick up. I wish the Lucene guys would peel these utility parts out into a separate library. I have several places it'd be useful, where I really have no need for all of the core Lucene (and also I use a 3.x version in my project, and this code is only in the 4.x branch, until that's released, I've have to maintain it myself. Kirby On Mon, Jul 25, 2011 at 3:35 AM, Julien Nioche < [email protected]> wrote: > Hi Kirby, > > Thanks for sharing this. It is definitely relevant for Nutch and I am sure > that there would be quite a few people interested in giving it a try. > Let's hope that this patch gets into the original library or that the > Lucene people ship it in a separate jar, in the meantime your patch would > help comparing performances. Could you please open a new issue on JIRA and > include the patch + description? It will be easier to comment and track its > progress. > > Thanks a lot > > Julien > > > On 25 July 2011 05:01, Kirby Bohling <[email protected]> wrote: > >> All, >> >> Not sure how much you guys care, but the Lucene folks (specifically >> rmuir and mikemcand), made some fairly significant performance speed >> ups to the Automaton library while working on the Lucene Fuzzy >> matching optimizations for the 4.0 release. I've backported them to >> the Automaton library and trying to get them integrated into the >> mainline library (with permission from the Lucene devs). I haven't >> heard back from the Automaton author, but I figured that enough folks >> have made noise about how nice performance boost of using Automaton >> vs. RegEx, that Nutch itself might want to integrate these types of >> changes, or re-use the ones from Lucene. >> >> The best version of the code itself is here: >> >> >> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ >> >> Nutch would likely only use 1/2-2/3 of those files (only the stuff >> required to build RegExp). >> >> The patch I applied to the latest Automaton library is attached if >> anybody wants to rebuild and test. In some mainline code that does a >> _lot_ of NFA-to-DFA translation, it is a 4x speed up. For the actual >> execution of the DFAs, I'm not sure how much faster it actually is (I >> think 1.5-2.0 as fast). My patch doesn't include the UTF-32 fixes in >> the Lucene version (The Lucene code also converts the UTF-32 to UTF-8 >> representation, and uses several Lucene internal implementations of >> memory growth, sorting, etc, etc). It is unfortunate that the Lucene >> version isn't broken out into a utility jar to be re-used. Lucene has >> several really nice high performance non-trivial, but highly useful CS >> data structure implementations. >> >> My patch itself applies to the latest Automaton library (1.11-7 as of >> this writing). If it is better to use the original Automaton library. >> One annoyance of the Automaton library is that you have to submit >> personal info to get the source, but it is all BSD licensed. No >> public repo of source. >> >> It might be worth while to port the plugins using the automaton >> library to use the version from Lucene or one with the patch applied >> and test the performance. >> >> Thanks, >> Kirby >> > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com >

