We may be able to make DaciukMihovAutomatonBuilder's state registry more ram efficient too ... I think it's essentially the same thing as the FST.Builder's NodeHash, just minus the outputs that FSTs have vs automata.
Mike McCandless http://blog.mikemccandless.com On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss <dawid.we...@gmail.com> wrote: > You could try using morfologik's byte-based implementation: > > https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java > > I can't guarantee it'll be fast enough -- you need to sort those input > sequences and even this may take a while. The construction of the > automaton after that is fairly fast. What are the time limits you have > with respect to input data sizes? Perhaps it's just unrealistic to > assume everything is performed as part of a single request? > > Dawid > > On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <o.mann...@gmail.com> wrote: >> Hi Mike, >> >> Thanks for the suggestion, I've tried Operations.run on a Automaton and >> it's fast enough for my use case. >> >> However, the real problem I have is in building the Automaton >> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a >> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb >> of garbage produced during build (DaciukMihovAutomatonBuilder.States I >> think). I'm using this in an app that also serves realtime requests, which >> timeout during building of the Automaton. It's an inherited application >> design and not the best (ideally I'd be building the Automaton offline in >> another process) and I don't expect it's a use case considered for >> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking >> into other data structures for doing prefix matching. The dk.brics >> Automaton has a similar performance profile, while the PatriciaTrie from >> Apache Commons Collections seems to consume less CPU and produce less >> garbage during build, although it has a less than ideal interface (it's a >> Map). >> >> Any further suggestion though would be most welcome! >> >> Regards, >> >> Oliver >> >> On 14 February 2017 at 22:56, Michael McCandless <luc...@mikemccandless.com> >> wrote: >> >>> Wow, 2G heap, that's horrible! >>> >>> How much heap does the automaton itself take? >>> >>> You can use the automaton's step method to transition from a state >>> given the next input character to another state (or -1 if that state >>> doesn't accept that character); it will be slower than the 2 GB run >>> automaton, but perhaps fast enough? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> >>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <o.mann...@gmail.com> >>> wrote: >>> > Thanks Mike for getting back to me, sounds like I'm on the right track. >>> > >>> > I'm building the automaton from around 1.7million strings, and it ends up >>> > with about 3.8million states and it turns out building a >>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite >>> > suprised!), with negligible performance difference at run time. At your >>> > suggestion I tried the ByteRunAutomaton and it was similar to the >>> > CharacterRunAutomaton >>> > in terms of heap and run time. So for now I'm going to just stick to an >>> > Automaton. >>> > >>> > On 14 February 2017 at 00:41, Michael McCandless < >>> luc...@mikemccandless.com> >>> > wrote: >>> > >>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <o.mann...@gmail.com> >>> >> wrote: >>> >> >>> >> > I'd like to construct an Automaton to prefix match against a large >>> set of >>> >> > strings. I gather a RunAutomation is immutable, thread safe and faster >>> >> than >>> >> > Automaton. >>> >> >>> >> That's correct. >>> >> >>> >> > Are there any other differences between the three Automaton >>> >> > classes, for example, in memory usage? >>> >> >>> >> CompiledAutomaton is just a thin wrapper to hold onto both the >>> >> original Automaton and the RunAutomaton, plus some other corner-casey >>> >> things that are likely not interesting for your usage. >>> >> >>> >> > Would the general approach for such a problem be to use >>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted >>> list >>> >> of >>> >> > my string set, set all the states to accept (to enable prefix >>> matching), >>> >> > then pass the Automaton into the constructor of a >>> CharacterRunAutomaton, >>> >> > and use the run method on the CharacterRunAutomaton to match any >>> queries? >>> >> >>> >> That sounds right. >>> >> >>> >> You could also try doing everything in UTF8 space instead, and use >>> >> ByteRunAutomaton: it will likely be faster since it will do faster >>> >> lookups on each transition. It should still be safe to set all states >>> >> as accept, even though some of those states will be inside a single >>> >> Unicode character, as long as the strings you test against are whole >>> >> UTF-8 sequences? >>> >> >>> >> > As it seems like I'm building up the Automaton at least three times, >>> and >>> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is >>> >> this >>> >> > the most memory efficient way of building such an Automaton? >>> >> >>> >> Yeah, it is. The RunAutomaton will likely require substantial heap in >>> >> your case, probably more than the original automaton. >>> >> >>> >> I suppose you don't actually need to keep the Automaton around once >>> >> the RunAutomaton is built, but Lucene doesn't make this possible >>> >> today, since the RunAutomaton holds onto the Automaton... >>> >> >>> >> > Thanks in advance, >>> >> >>> >> You're welcome! >>> >> >>> >> Mike McCandless >>> >> >>> >> http://blog.mikemccandless.com >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >>> >> >>> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org