Re: Automaton improvements

Dawid Weiss Mon, 25 Jul 2011 01:52:56 -0700

I don't think this will make it into a separate library, Julien. It's a port
of brics and done specifically so that it fits Lucene's internal needs. If
anything, I would just make Nutch require Lucene as a dependency -- this
would provide more stable updates.


Dawid

On Mon, Jul 25, 2011 at 10:35 AM, Julien Nioche <
[email protected]> wrote:

> Hi Kirby,
>
> Thanks for sharing this. It is definitely relevant for Nutch and I am sure
> that there would be quite a few people interested in giving it a try.
> Let's hope that this patch gets into the original library or that the
> Lucene people ship it in a separate jar, in the meantime your patch would
> help comparing performances. Could you please open a new issue on JIRA and
> include the patch + description? It will be easier to comment and track its
> progress.
>
> Thanks a lot
>
> Julien
>
>
> On 25 July 2011 05:01, Kirby Bohling <[email protected]> wrote:
>
>> All,
>>
>>   Not sure how much you guys care, but the Lucene folks (specifically
>> rmuir and mikemcand), made some fairly significant performance speed
>> ups to the Automaton library while working on the Lucene Fuzzy
>> matching optimizations for the 4.0 release.  I've backported them to
>> the Automaton library and trying to get them integrated into the
>> mainline library (with permission from the Lucene devs).  I haven't
>> heard back from the Automaton author, but I figured that enough folks
>> have made noise about how nice performance boost of using Automaton
>> vs. RegEx, that Nutch itself might want to integrate these types of
>> changes, or re-use the ones from Lucene.
>>
>>   The best version of the code itself is here:
>>
>>
>> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
>>
>> Nutch would likely only use 1/2-2/3 of those files (only the stuff
>> required to build RegExp).
>>
>> The patch I applied to the latest Automaton library is attached if
>> anybody wants to rebuild and test.  In some mainline code that does a
>> _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
>> execution of the DFAs, I'm not sure how much faster it actually is (I
>> think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
>> the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
>> representation, and uses several Lucene internal implementations of
>> memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
>> version isn't broken out into a utility jar to be re-used.  Lucene has
>> several really nice high performance non-trivial, but highly useful CS
>> data structure implementations.
>>
>> My patch itself applies to the latest Automaton library (1.11-7 as of
>> this writing).  If it is better to use the original Automaton library.
>>  One annoyance of the Automaton library is that you have to submit
>> personal info to get the source, but it is all BSD licensed.  No
>> public repo of source.
>>
>> It might be worth while to port the plugins using the automaton
>> library to use the version from Lucene or one with the patch applied
>> and test the performance.
>>
>> Thanks,
>>    Kirby
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Automaton improvements

Reply via email to