Yep, true. I just wonder whether it's worth complicating the code...
Could be easier to build an FST<Void> and then recreate a RunAutomaton
from that directly... :)

Dawid

On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless
<luc...@mikemccandless.com> wrote:
> We may be able to make DaciukMihovAutomatonBuilder's state registry
> more ram efficient too ... I think it's essentially the same thing as
> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
> automata.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss <dawid.we...@gmail.com> wrote:
>> You could try using morfologik's byte-based implementation:
>>
>> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java
>>
>> I can't guarantee it'll be fast enough -- you need to sort those input
>> sequences and even this may take a while. The construction of the
>> automaton after that is fairly fast. What are the time limits you have
>> with respect to input data sizes? Perhaps it's just unrealistic to
>> assume everything is performed as part of a single request?
>>
>> Dawid
>>
>> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <o.mann...@gmail.com> wrote:
>>> Hi Mike,
>>>
>>> Thanks for the suggestion, I've tried Operations.run on a Automaton and
>>> it's fast enough for my use case.
>>>
>>> However, the real problem I have is in building the Automaton
>>> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
>>> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
>>> of garbage produced during build (DaciukMihovAutomatonBuilder.States I
>>> think). I'm using this in an app that also serves realtime requests, which
>>> timeout during building of the Automaton. It's an inherited application
>>> design and not the best (ideally I'd be building the Automaton offline in
>>> another process) and I don't expect it's a use case considered for
>>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
>>> into other data structures for doing prefix matching. The dk.brics
>>> Automaton has a similar performance profile, while the PatriciaTrie from
>>> Apache Commons Collections seems to consume less CPU and produce less
>>> garbage during build, although it has a less than ideal interface (it's a
>>> Map).
>>>
>>> Any further suggestion though would be most welcome!
>>>
>>> Regards,
>>>
>>> Oliver
>>>
>>> On 14 February 2017 at 22:56, Michael McCandless <luc...@mikemccandless.com>
>>> wrote:
>>>
>>>> Wow, 2G heap, that's horrible!
>>>>
>>>> How much heap does the automaton itself take?
>>>>
>>>> You can use the automaton's step method to transition from a state
>>>> given the next input character to another state (or -1 if that state
>>>> doesn't accept that character); it will be slower than the 2 GB run
>>>> automaton, but perhaps fast enough?
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <o.mann...@gmail.com>
>>>> wrote:
>>>> > Thanks Mike for getting back to me, sounds like I'm on the right track.
>>>> >
>>>> > I'm building the automaton from around 1.7million strings, and it ends up
>>>> > with about 3.8million states and it turns out building a
>>>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
>>>> > suprised!), with negligible performance difference at run time. At your
>>>> > suggestion I tried the ByteRunAutomaton and it was similar to the
>>>> > CharacterRunAutomaton
>>>> > in terms of heap and run time.  So for now I'm going to just stick to an
>>>> > Automaton.
>>>> >
>>>> > On 14 February 2017 at 00:41, Michael McCandless <
>>>> luc...@mikemccandless.com>
>>>> > wrote:
>>>> >
>>>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <o.mann...@gmail.com>
>>>> >> wrote:
>>>> >>
>>>> >> > I'd like to construct an Automaton to prefix match against a large
>>>> set of
>>>> >> > strings. I gather a RunAutomation is immutable, thread safe and faster
>>>> >> than
>>>> >> > Automaton.
>>>> >>
>>>> >> That's correct.
>>>> >>
>>>> >> > Are there any other differences between the three Automaton
>>>> >> > classes, for example, in memory usage?
>>>> >>
>>>> >> CompiledAutomaton is just a thin wrapper to hold onto both the
>>>> >> original Automaton and the RunAutomaton, plus some other corner-casey
>>>> >> things that are likely not interesting for your usage.
>>>> >>
>>>> >> > Would the general approach for such a problem be to use
>>>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted
>>>> list
>>>> >> of
>>>> >> > my string set, set all the states to accept (to enable prefix
>>>> matching),
>>>> >> > then pass the Automaton into the constructor of a
>>>> CharacterRunAutomaton,
>>>> >> > and use the run method on the CharacterRunAutomaton to match any
>>>> queries?
>>>> >>
>>>> >> That sounds right.
>>>> >>
>>>> >> You could also try doing everything in UTF8 space instead, and use
>>>> >> ByteRunAutomaton: it will likely be faster since it will do faster
>>>> >> lookups on each transition.  It should still be safe to set all states
>>>> >> as accept, even though some of those states will be inside a single
>>>> >> Unicode character, as long as the strings you test against are whole
>>>> >> UTF-8 sequences?
>>>> >>
>>>> >> > As it seems like I'm building up the Automaton at least three times,
>>>> and
>>>> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
>>>> >> this
>>>> >> > the most memory efficient way of building such an Automaton?
>>>> >>
>>>> >> Yeah, it is.  The RunAutomaton will likely require substantial heap in
>>>> >> your case, probably more than the original automaton.
>>>> >>
>>>> >> I suppose you don't actually need to keep the Automaton around once
>>>> >> the RunAutomaton is built, but Lucene doesn't make this possible
>>>> >> today, since the RunAutomaton holds onto the Automaton...
>>>> >>
>>>> >> > Thanks in advance,
>>>> >>
>>>> >> You're welcome!
>>>> >>
>>>> >> Mike McCandless
>>>> >>
>>>> >> http://blog.mikemccandless.com
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>> >>
>>>> >>
>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to