Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Michael McCandless Wed, 15 Feb 2017 02:26:48 -0800

We may be able to make DaciukMihovAutomatonBuilder's state registry
more ram efficient too ... I think it's essentially the same thing as
the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
automata.


Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss <[email protected]> wrote:
> You could try using morfologik's byte-based implementation:
>
> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java
>
> I can't guarantee it'll be fast enough -- you need to sort those input
> sequences and even this may take a while. The construction of the
> automaton after that is fairly fast. What are the time limits you have
> with respect to input data sizes? Perhaps it's just unrealistic to
> assume everything is performed as part of a single request?
>
> Dawid
>
> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <[email protected]> wrote:
>> Hi Mike,
>>
>> Thanks for the suggestion, I've tried Operations.run on a Automaton and
>> it's fast enough for my use case.
>>
>> However, the real problem I have is in building the Automaton
>> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
>> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
>> of garbage produced during build (DaciukMihovAutomatonBuilder.States I
>> think). I'm using this in an app that also serves realtime requests, which
>> timeout during building of the Automaton. It's an inherited application
>> design and not the best (ideally I'd be building the Automaton offline in
>> another process) and I don't expect it's a use case considered for
>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
>> into other data structures for doing prefix matching. The dk.brics
>> Automaton has a similar performance profile, while the PatriciaTrie from
>> Apache Commons Collections seems to consume less CPU and produce less
>> garbage during build, although it has a less than ideal interface (it's a
>> Map).
>>
>> Any further suggestion though would be most welcome!
>>
>> Regards,
>>
>> Oliver
>>
>> On 14 February 2017 at 22:56, Michael McCandless <[email protected]>
>> wrote:
>>
>>> Wow, 2G heap, that's horrible!
>>>
>>> How much heap does the automaton itself take?
>>>
>>> You can use the automaton's step method to transition from a state
>>> given the next input character to another state (or -1 if that state
>>> doesn't accept that character); it will be slower than the 2 GB run
>>> automaton, but perhaps fast enough?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <[email protected]>
>>> wrote:
>>> > Thanks Mike for getting back to me, sounds like I'm on the right track.
>>> >
>>> > I'm building the automaton from around 1.7million strings, and it ends up
>>> > with about 3.8million states and it turns out building a
>>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
>>> > suprised!), with negligible performance difference at run time. At your
>>> > suggestion I tried the ByteRunAutomaton and it was similar to the
>>> > CharacterRunAutomaton
>>> > in terms of heap and run time.  So for now I'm going to just stick to an
>>> > Automaton.
>>> >
>>> > On 14 February 2017 at 00:41, Michael McCandless <
>>> [email protected]>
>>> > wrote:
>>> >
>>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <[email protected]>
>>> >> wrote:
>>> >>
>>> >> > I'd like to construct an Automaton to prefix match against a large
>>> set of
>>> >> > strings. I gather a RunAutomation is immutable, thread safe and faster
>>> >> than
>>> >> > Automaton.
>>> >>
>>> >> That's correct.
>>> >>
>>> >> > Are there any other differences between the three Automaton
>>> >> > classes, for example, in memory usage?
>>> >>
>>> >> CompiledAutomaton is just a thin wrapper to hold onto both the
>>> >> original Automaton and the RunAutomaton, plus some other corner-casey
>>> >> things that are likely not interesting for your usage.
>>> >>
>>> >> > Would the general approach for such a problem be to use
>>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted
>>> list
>>> >> of
>>> >> > my string set, set all the states to accept (to enable prefix
>>> matching),
>>> >> > then pass the Automaton into the constructor of a
>>> CharacterRunAutomaton,
>>> >> > and use the run method on the CharacterRunAutomaton to match any
>>> queries?
>>> >>
>>> >> That sounds right.
>>> >>
>>> >> You could also try doing everything in UTF8 space instead, and use
>>> >> ByteRunAutomaton: it will likely be faster since it will do faster
>>> >> lookups on each transition.  It should still be safe to set all states
>>> >> as accept, even though some of those states will be inside a single
>>> >> Unicode character, as long as the strings you test against are whole
>>> >> UTF-8 sequences?
>>> >>
>>> >> > As it seems like I'm building up the Automaton at least three times,
>>> and
>>> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
>>> >> this
>>> >> > the most memory efficient way of building such an Automaton?
>>> >>
>>> >> Yeah, it is.  The RunAutomaton will likely require substantial heap in
>>> >> your case, probably more than the original automaton.
>>> >>
>>> >> I suppose you don't actually need to keep the Automaton around once
>>> >> the RunAutomaton is built, but Lucene doesn't make this possible
>>> >> today, since the RunAutomaton holds onto the Automaton...
>>> >>
>>> >> > Thanks in advance,
>>> >>
>>> >> You're welcome!
>>> >>
>>> >> Mike McCandless
>>> >>
>>> >> http://blog.mikemccandless.com
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [email protected]
>>> >> For additional commands, e-mail: [email protected]
>>> >>
>>> >>
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

Reply via email to