Actually, that's a great idea to try (Oliver). It would be a relatively simple conversion... maybe Lucene could add some sugar on top, e.g. to convert an FST<Void> to an automaton. Hmm, maybe it even exists somewhere already...
But even the FST Builder's NodeHash can be non-trivial in its heap usage, but hopefully less than DaciukMihovAutomatonBuilder. (And yes I do love how simple DaciukMihovAutomatonBuilder is). Mike McCandless http://blog.mikemccandless.com On Wed, Feb 15, 2017 at 5:39 AM, Dawid Weiss <dawid.we...@gmail.com> wrote: > Yep, true. I just wonder whether it's worth complicating the code... > Could be easier to build an FST<Void> and then recreate a RunAutomaton > from that directly... :) > > Dawid > > On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless > <luc...@mikemccandless.com> wrote: >> We may be able to make DaciukMihovAutomatonBuilder's state registry >> more ram efficient too ... I think it's essentially the same thing as >> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs >> automata. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss <dawid.we...@gmail.com> wrote: >>> You could try using morfologik's byte-based implementation: >>> >>> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java >>> >>> I can't guarantee it'll be fast enough -- you need to sort those input >>> sequences and even this may take a while. The construction of the >>> automaton after that is fairly fast. What are the time limits you have >>> with respect to input data sizes? Perhaps it's just unrealistic to >>> assume everything is performed as part of a single request? >>> >>> Dawid >>> >>> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <o.mann...@gmail.com> >>> wrote: >>>> Hi Mike, >>>> >>>> Thanks for the suggestion, I've tried Operations.run on a Automaton and >>>> it's fast enough for my use case. >>>> >>>> However, the real problem I have is in building the Automaton >>>> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a >>>> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb >>>> of garbage produced during build (DaciukMihovAutomatonBuilder.States I >>>> think). I'm using this in an app that also serves realtime requests, which >>>> timeout during building of the Automaton. It's an inherited application >>>> design and not the best (ideally I'd be building the Automaton offline in >>>> another process) and I don't expect it's a use case considered for >>>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking >>>> into other data structures for doing prefix matching. The dk.brics >>>> Automaton has a similar performance profile, while the PatriciaTrie from >>>> Apache Commons Collections seems to consume less CPU and produce less >>>> garbage during build, although it has a less than ideal interface (it's a >>>> Map). >>>> >>>> Any further suggestion though would be most welcome! >>>> >>>> Regards, >>>> >>>> Oliver >>>> >>>> On 14 February 2017 at 22:56, Michael McCandless >>>> <luc...@mikemccandless.com> >>>> wrote: >>>> >>>>> Wow, 2G heap, that's horrible! >>>>> >>>>> How much heap does the automaton itself take? >>>>> >>>>> You can use the automaton's step method to transition from a state >>>>> given the next input character to another state (or -1 if that state >>>>> doesn't accept that character); it will be slower than the 2 GB run >>>>> automaton, but perhaps fast enough? >>>>> >>>>> Mike McCandless >>>>> >>>>> http://blog.mikemccandless.com >>>>> >>>>> >>>>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <o.mann...@gmail.com> >>>>> wrote: >>>>> > Thanks Mike for getting back to me, sounds like I'm on the right track. >>>>> > >>>>> > I'm building the automaton from around 1.7million strings, and it ends >>>>> > up >>>>> > with about 3.8million states and it turns out building a >>>>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite >>>>> > suprised!), with negligible performance difference at run time. At your >>>>> > suggestion I tried the ByteRunAutomaton and it was similar to the >>>>> > CharacterRunAutomaton >>>>> > in terms of heap and run time. So for now I'm going to just stick to an >>>>> > Automaton. >>>>> > >>>>> > On 14 February 2017 at 00:41, Michael McCandless < >>>>> luc...@mikemccandless.com> >>>>> > wrote: >>>>> > >>>>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <o.mann...@gmail.com> >>>>> >> wrote: >>>>> >> >>>>> >> > I'd like to construct an Automaton to prefix match against a large >>>>> set of >>>>> >> > strings. I gather a RunAutomation is immutable, thread safe and >>>>> >> > faster >>>>> >> than >>>>> >> > Automaton. >>>>> >> >>>>> >> That's correct. >>>>> >> >>>>> >> > Are there any other differences between the three Automaton >>>>> >> > classes, for example, in memory usage? >>>>> >> >>>>> >> CompiledAutomaton is just a thin wrapper to hold onto both the >>>>> >> original Automaton and the RunAutomaton, plus some other corner-casey >>>>> >> things that are likely not interesting for your usage. >>>>> >> >>>>> >> > Would the general approach for such a problem be to use >>>>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted >>>>> list >>>>> >> of >>>>> >> > my string set, set all the states to accept (to enable prefix >>>>> matching), >>>>> >> > then pass the Automaton into the constructor of a >>>>> CharacterRunAutomaton, >>>>> >> > and use the run method on the CharacterRunAutomaton to match any >>>>> queries? >>>>> >> >>>>> >> That sounds right. >>>>> >> >>>>> >> You could also try doing everything in UTF8 space instead, and use >>>>> >> ByteRunAutomaton: it will likely be faster since it will do faster >>>>> >> lookups on each transition. It should still be safe to set all states >>>>> >> as accept, even though some of those states will be inside a single >>>>> >> Unicode character, as long as the strings you test against are whole >>>>> >> UTF-8 sequences? >>>>> >> >>>>> >> > As it seems like I'm building up the Automaton at least three times, >>>>> and >>>>> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is >>>>> >> this >>>>> >> > the most memory efficient way of building such an Automaton? >>>>> >> >>>>> >> Yeah, it is. The RunAutomaton will likely require substantial heap in >>>>> >> your case, probably more than the original automaton. >>>>> >> >>>>> >> I suppose you don't actually need to keep the Automaton around once >>>>> >> the RunAutomaton is built, but Lucene doesn't make this possible >>>>> >> today, since the RunAutomaton holds onto the Automaton... >>>>> >> >>>>> >> > Thanks in advance, >>>>> >> >>>>> >> You're welcome! >>>>> >> >>>>> >> Mike McCandless >>>>> >> >>>>> >> http://blog.mikemccandless.com >>>>> >> >>>>> >> --------------------------------------------------------------------- >>>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >> >>>>> >> >>>>> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org