Re: Automaton improvements

Julien Nioche Mon, 25 Jul 2011 02:07:43 -0700

Hi Dawid,

This was a bit of wishful thinking indeed :-) With a bit of luck the
improvements will be added to brics, but as you pointed out we can always
use the lucene jar anyway.


BTW you are too modest, you should have pointed to the video of your talk in
Berlin http://vimeo.com/26517310 which is both informative and entertaining

Thanks

Julien

On 25 July 2011 09:51, Dawid Weiss <[email protected]> wrote:

>
> I don't think this will make it into a separate library, Julien. It's a
> port of brics and done specifically so that it fits Lucene's internal needs.
> If anything, I would just make Nutch require Lucene as a dependency -- this
> would provide more stable updates.
>
> Dawid
>
>
> On Mon, Jul 25, 2011 at 10:35 AM, Julien Nioche <
> [email protected]> wrote:
>
>> Hi Kirby,
>>
>> Thanks for sharing this. It is definitely relevant for Nutch and I am sure
>> that there would be quite a few people interested in giving it a try.
>> Let's hope that this patch gets into the original library or that the
>> Lucene people ship it in a separate jar, in the meantime your patch would
>> help comparing performances. Could you please open a new issue on JIRA and
>> include the patch + description? It will be easier to comment and track its
>> progress.
>>
>> Thanks a lot
>>
>> Julien
>>
>>
>> On 25 July 2011 05:01, Kirby Bohling <[email protected]> wrote:
>>
>>> All,
>>>
>>>   Not sure how much you guys care, but the Lucene folks (specifically
>>> rmuir and mikemcand), made some fairly significant performance speed
>>> ups to the Automaton library while working on the Lucene Fuzzy
>>> matching optimizations for the 4.0 release.  I've backported them to
>>> the Automaton library and trying to get them integrated into the
>>> mainline library (with permission from the Lucene devs).  I haven't
>>> heard back from the Automaton author, but I figured that enough folks
>>> have made noise about how nice performance boost of using Automaton
>>> vs. RegEx, that Nutch itself might want to integrate these types of
>>> changes, or re-use the ones from Lucene.
>>>
>>>   The best version of the code itself is here:
>>>
>>>
>>> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
>>>
>>> Nutch would likely only use 1/2-2/3 of those files (only the stuff
>>> required to build RegExp).
>>>
>>> The patch I applied to the latest Automaton library is attached if
>>> anybody wants to rebuild and test.  In some mainline code that does a
>>> _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
>>> execution of the DFAs, I'm not sure how much faster it actually is (I
>>> think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
>>> the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
>>> representation, and uses several Lucene internal implementations of
>>> memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
>>> version isn't broken out into a utility jar to be re-used.  Lucene has
>>> several really nice high performance non-trivial, but highly useful CS
>>> data structure implementations.
>>>
>>> My patch itself applies to the latest Automaton library (1.11-7 as of
>>> this writing).  If it is better to use the original Automaton library.
>>>  One annoyance of the Automaton library is that you have to submit
>>> personal info to get the source, but it is all BSD licensed.  No
>>> public repo of source.
>>>
>>> It might be worth while to port the plugins using the automaton
>>> library to use the version from Lucene or one with the patch applied
>>> and test the performance.
>>>
>>> Thanks,
>>>    Kirby
>>>
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Automaton improvements

Reply via email to