https://issues.apache.org/jira/browse/NUTCH-1068

Issue created, patch attached.  Once I hear back from the author about
getting it included in the upstream library, I'll update the issue.  I'm
really not able to pursue directly, as I'm not much of a Nutch user at the
moment.  I've lurked on the list because there is some good info, and I
previously used Nutch as part of a R&D project at work.  I use Lucene and
the Automaton library quite a bit, and found out about the Automaton library
here.  It's been a great find for us, so hopefully this is a way I can
contribute back.  Either way, the ASF likely already has better code that
Nutch could just pick up.

I wish the Lucene guys would peel these utility parts out into a separate
library.  I have several places it'd be useful, where I really have no need
for all of the core Lucene (and also I use a 3.x version in my project, and
this code is only in the 4.x branch, until that's released, I've have to
maintain it myself.

Kirby


On Mon, Jul 25, 2011 at 3:35 AM, Julien Nioche <
[email protected]> wrote:

> Hi Kirby,
>
> Thanks for sharing this. It is definitely relevant for Nutch and I am sure
> that there would be quite a few people interested in giving it a try.
> Let's hope that this patch gets into the original library or that the
> Lucene people ship it in a separate jar, in the meantime your patch would
> help comparing performances. Could you please open a new issue on JIRA and
> include the patch + description? It will be easier to comment and track its
> progress.
>
> Thanks a lot
>
> Julien
>
>
> On 25 July 2011 05:01, Kirby Bohling <[email protected]> wrote:
>
>> All,
>>
>>   Not sure how much you guys care, but the Lucene folks (specifically
>> rmuir and mikemcand), made some fairly significant performance speed
>> ups to the Automaton library while working on the Lucene Fuzzy
>> matching optimizations for the 4.0 release.  I've backported them to
>> the Automaton library and trying to get them integrated into the
>> mainline library (with permission from the Lucene devs).  I haven't
>> heard back from the Automaton author, but I figured that enough folks
>> have made noise about how nice performance boost of using Automaton
>> vs. RegEx, that Nutch itself might want to integrate these types of
>> changes, or re-use the ones from Lucene.
>>
>>   The best version of the code itself is here:
>>
>>
>> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
>>
>> Nutch would likely only use 1/2-2/3 of those files (only the stuff
>> required to build RegExp).
>>
>> The patch I applied to the latest Automaton library is attached if
>> anybody wants to rebuild and test.  In some mainline code that does a
>> _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
>> execution of the DFAs, I'm not sure how much faster it actually is (I
>> think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
>> the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
>> representation, and uses several Lucene internal implementations of
>> memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
>> version isn't broken out into a utility jar to be re-used.  Lucene has
>> several really nice high performance non-trivial, but highly useful CS
>> data structure implementations.
>>
>> My patch itself applies to the latest Automaton library (1.11-7 as of
>> this writing).  If it is better to use the original Automaton library.
>>  One annoyance of the Automaton library is that you have to submit
>> personal info to get the source, but it is all BSD licensed.  No
>> public repo of source.
>>
>> It might be worth while to port the plugins using the automaton
>> library to use the version from Lucene or one with the patch applied
>> and test the performance.
>>
>> Thanks,
>>    Kirby
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Reply via email to