Yep, true. I just wonder whether it's worth complicating the code... Could be easier to build an FST<Void> and then recreate a RunAutomaton from that directly... :)
Dawid On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless <luc...@mikemccandless.com> wrote: > We may be able to make DaciukMihovAutomatonBuilder's state registry > more ram efficient too ... I think it's essentially the same thing as > the FST.Builder's NodeHash, just minus the outputs that FSTs have vs > automata. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss <dawid.we...@gmail.com> wrote: >> You could try using morfologik's byte-based implementation: >> >> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java >> >> I can't guarantee it'll be fast enough -- you need to sort those input >> sequences and even this may take a while. The construction of the >> automaton after that is fairly fast. What are the time limits you have >> with respect to input data sizes? Perhaps it's just unrealistic to >> assume everything is performed as part of a single request? >> >> Dawid >> >> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <o.mann...@gmail.com> wrote: >>> Hi Mike, >>> >>> Thanks for the suggestion, I've tried Operations.run on a Automaton and >>> it's fast enough for my use case. >>> >>> However, the real problem I have is in building the Automaton >>> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a >>> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb >>> of garbage produced during build (DaciukMihovAutomatonBuilder.States I >>> think). I'm using this in an app that also serves realtime requests, which >>> timeout during building of the Automaton. It's an inherited application >>> design and not the best (ideally I'd be building the Automaton offline in >>> another process) and I don't expect it's a use case considered for >>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking >>> into other data structures for doing prefix matching. The dk.brics >>> Automaton has a similar performance profile, while the PatriciaTrie from >>> Apache Commons Collections seems to consume less CPU and produce less >>> garbage during build, although it has a less than ideal interface (it's a >>> Map). >>> >>> Any further suggestion though would be most welcome! >>> >>> Regards, >>> >>> Oliver >>> >>> On 14 February 2017 at 22:56, Michael McCandless <luc...@mikemccandless.com> >>> wrote: >>> >>>> Wow, 2G heap, that's horrible! >>>> >>>> How much heap does the automaton itself take? >>>> >>>> You can use the automaton's step method to transition from a state >>>> given the next input character to another state (or -1 if that state >>>> doesn't accept that character); it will be slower than the 2 GB run >>>> automaton, but perhaps fast enough? >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> >>>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <o.mann...@gmail.com> >>>> wrote: >>>> > Thanks Mike for getting back to me, sounds like I'm on the right track. >>>> > >>>> > I'm building the automaton from around 1.7million strings, and it ends up >>>> > with about 3.8million states and it turns out building a >>>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite >>>> > suprised!), with negligible performance difference at run time. At your >>>> > suggestion I tried the ByteRunAutomaton and it was similar to the >>>> > CharacterRunAutomaton >>>> > in terms of heap and run time. So for now I'm going to just stick to an >>>> > Automaton. >>>> > >>>> > On 14 February 2017 at 00:41, Michael McCandless < >>>> luc...@mikemccandless.com> >>>> > wrote: >>>> > >>>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <o.mann...@gmail.com> >>>> >> wrote: >>>> >> >>>> >> > I'd like to construct an Automaton to prefix match against a large >>>> set of >>>> >> > strings. I gather a RunAutomation is immutable, thread safe and faster >>>> >> than >>>> >> > Automaton. >>>> >> >>>> >> That's correct. >>>> >> >>>> >> > Are there any other differences between the three Automaton >>>> >> > classes, for example, in memory usage? >>>> >> >>>> >> CompiledAutomaton is just a thin wrapper to hold onto both the >>>> >> original Automaton and the RunAutomaton, plus some other corner-casey >>>> >> things that are likely not interesting for your usage. >>>> >> >>>> >> > Would the general approach for such a problem be to use >>>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted >>>> list >>>> >> of >>>> >> > my string set, set all the states to accept (to enable prefix >>>> matching), >>>> >> > then pass the Automaton into the constructor of a >>>> CharacterRunAutomaton, >>>> >> > and use the run method on the CharacterRunAutomaton to match any >>>> queries? >>>> >> >>>> >> That sounds right. >>>> >> >>>> >> You could also try doing everything in UTF8 space instead, and use >>>> >> ByteRunAutomaton: it will likely be faster since it will do faster >>>> >> lookups on each transition. It should still be safe to set all states >>>> >> as accept, even though some of those states will be inside a single >>>> >> Unicode character, as long as the strings you test against are whole >>>> >> UTF-8 sequences? >>>> >> >>>> >> > As it seems like I'm building up the Automaton at least three times, >>>> and >>>> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is >>>> >> this >>>> >> > the most memory efficient way of building such an Automaton? >>>> >> >>>> >> Yeah, it is. The RunAutomaton will likely require substantial heap in >>>> >> your case, probably more than the original automaton. >>>> >> >>>> >> I suppose you don't actually need to keep the Automaton around once >>>> >> the RunAutomaton is built, but Lucene doesn't make this possible >>>> >> today, since the RunAutomaton holds onto the Automaton... >>>> >> >>>> >> > Thanks in advance, >>>> >> >>>> >> You're welcome! >>>> >> >>>> >> Mike McCandless >>>> >> >>>> >> http://blog.mikemccandless.com >>>> >> >>>> >> --------------------------------------------------------------------- >>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >> >>>> >> >>>> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org