Hi David & Mike, Thanks for the suggestions!
Regarding my requirements for building the FSA, it happens in a background process, not as part of a request, although the app is still serving requests at the same time. So the time taken to build isn't really the issue, it's just that it shouldn't starve the other requests of CPU. I've tried building an FST<Object> with NoOutputs, and morfologik's FSA, and both are much better in terms of build time, CPU and RAM usage compared to Automaton. They are similar to, and in some dimensions even better than, the PatriciaTrie. In particular building an FST with doShareSuffix = false is the fastest of any option, and morfologik's fsa has the lowest heap usage of all (both in terms of garbage and the final size of the FSA). Either will meet my requirements. I quite like the API provided by morfologik (no need to work with BytesRef) and it supports both prefix (MatchResult.SEQUENCE_IS_A_PREFIX) and exact matching (MatchResult.EXACT_MATCH). Given the efficiencies and functionality similarity of the FST vs Automation in Lucene, I'm kind of curious as to why you would ever use Automaton? Thanks very much for your help! Oliver On 15 February 2017 at 21:49, Michael McCandless <luc...@mikemccandless.com> wrote: > Actually, that's a great idea to try (Oliver). It would be a > relatively simple conversion... maybe Lucene could add some sugar on > top, e.g. to convert an FST<Void> to an automaton. Hmm, maybe it even > exists somewhere already... > > But even the FST Builder's NodeHash can be non-trivial in its heap > usage, but hopefully less than DaciukMihovAutomatonBuilder. > > (And yes I do love how simple DaciukMihovAutomatonBuilder is). > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Feb 15, 2017 at 5:39 AM, Dawid Weiss <dawid.we...@gmail.com> > wrote: > > Yep, true. I just wonder whether it's worth complicating the code... > > Could be easier to build an FST<Void> and then recreate a RunAutomaton > > from that directly... :) > > > > Dawid > > > > On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless > > <luc...@mikemccandless.com> wrote: > >> We may be able to make DaciukMihovAutomatonBuilder's state registry > >> more ram efficient too ... I think it's essentially the same thing as > >> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs > >> automata. > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> > >> On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss <dawid.we...@gmail.com> > wrote: > >>> You could try using morfologik's byte-based implementation: > >>> > >>> https://github.com/morfologik/morfologik-stemming/blob/ > master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/ > FSABuilderTest.java > >>> > >>> I can't guarantee it'll be fast enough -- you need to sort those input > >>> sequences and even this may take a while. The construction of the > >>> automaton after that is fairly fast. What are the time limits you have > >>> with respect to input data sizes? Perhaps it's just unrealistic to > >>> assume everything is performed as part of a single request? > >>> > >>> Dawid > >>> > >>> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <o.mann...@gmail.com> > wrote: > >>>> Hi Mike, > >>>> > >>>> Thanks for the suggestion, I've tried Operations.run on a Automaton > and > >>>> it's fast enough for my use case. > >>>> > >>>> However, the real problem I have is in building the Automaton > >>>> via DaciukMihovAutomatonBuilder. On my input string set it consumes > quite a > >>>> bit of CPU, a lot of which seems to be GC activity, cleaning up about > 800mb > >>>> of garbage produced during build (DaciukMihovAutomatonBuilder.States > I > >>>> think). I'm using this in an app that also serves realtime requests, > which > >>>> timeout during building of the Automaton. It's an inherited > application > >>>> design and not the best (ideally I'd be building the Automaton > offline in > >>>> another process) and I don't expect it's a use case considered for > >>>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started > looking > >>>> into other data structures for doing prefix matching. The dk.brics > >>>> Automaton has a similar performance profile, while the PatriciaTrie > from > >>>> Apache Commons Collections seems to consume less CPU and produce less > >>>> garbage during build, although it has a less than ideal interface > (it's a > >>>> Map). > >>>> > >>>> Any further suggestion though would be most welcome! > >>>> > >>>> Regards, > >>>> > >>>> Oliver > >>>> > >>>> On 14 February 2017 at 22:56, Michael McCandless < > luc...@mikemccandless.com> > >>>> wrote: > >>>> > >>>>> Wow, 2G heap, that's horrible! > >>>>> > >>>>> How much heap does the automaton itself take? > >>>>> > >>>>> You can use the automaton's step method to transition from a state > >>>>> given the next input character to another state (or -1 if that state > >>>>> doesn't accept that character); it will be slower than the 2 GB run > >>>>> automaton, but perhaps fast enough? > >>>>> > >>>>> Mike McCandless > >>>>> > >>>>> http://blog.mikemccandless.com > >>>>> > >>>>> > >>>>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <o.mann...@gmail.com > > > >>>>> wrote: > >>>>> > Thanks Mike for getting back to me, sounds like I'm on the right > track. > >>>>> > > >>>>> > I'm building the automaton from around 1.7million strings, and it > ends up > >>>>> > with about 3.8million states and it turns out building a > >>>>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was > quite > >>>>> > suprised!), with negligible performance difference at run time. At > your > >>>>> > suggestion I tried the ByteRunAutomaton and it was similar to the > >>>>> > CharacterRunAutomaton > >>>>> > in terms of heap and run time. So for now I'm going to just stick > to an > >>>>> > Automaton. > >>>>> > > >>>>> > On 14 February 2017 at 00:41, Michael McCandless < > >>>>> luc...@mikemccandless.com> > >>>>> > wrote: > >>>>> > > >>>>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion < > o.mann...@gmail.com> > >>>>> >> wrote: > >>>>> >> > >>>>> >> > I'd like to construct an Automaton to prefix match against a > large > >>>>> set of > >>>>> >> > strings. I gather a RunAutomation is immutable, thread safe and > faster > >>>>> >> than > >>>>> >> > Automaton. > >>>>> >> > >>>>> >> That's correct. > >>>>> >> > >>>>> >> > Are there any other differences between the three Automaton > >>>>> >> > classes, for example, in memory usage? > >>>>> >> > >>>>> >> CompiledAutomaton is just a thin wrapper to hold onto both the > >>>>> >> original Automaton and the RunAutomaton, plus some other > corner-casey > >>>>> >> things that are likely not interesting for your usage. > >>>>> >> > >>>>> >> > Would the general approach for such a problem be to use > >>>>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the > sorted > >>>>> list > >>>>> >> of > >>>>> >> > my string set, set all the states to accept (to enable prefix > >>>>> matching), > >>>>> >> > then pass the Automaton into the constructor of a > >>>>> CharacterRunAutomaton, > >>>>> >> > and use the run method on the CharacterRunAutomaton to match any > >>>>> queries? > >>>>> >> > >>>>> >> That sounds right. > >>>>> >> > >>>>> >> You could also try doing everything in UTF8 space instead, and use > >>>>> >> ByteRunAutomaton: it will likely be faster since it will do faster > >>>>> >> lookups on each transition. It should still be safe to set all > states > >>>>> >> as accept, even though some of those states will be inside a > single > >>>>> >> Unicode character, as long as the strings you test against are > whole > >>>>> >> UTF-8 sequences? > >>>>> >> > >>>>> >> > As it seems like I'm building up the Automaton at least three > times, > >>>>> and > >>>>> >> > keeping a reference to the Automaton in the > CharacterRunAutomaton, is > >>>>> >> this > >>>>> >> > the most memory efficient way of building such an Automaton? > >>>>> >> > >>>>> >> Yeah, it is. The RunAutomaton will likely require substantial > heap in > >>>>> >> your case, probably more than the original automaton. > >>>>> >> > >>>>> >> I suppose you don't actually need to keep the Automaton around > once > >>>>> >> the RunAutomaton is built, but Lucene doesn't make this possible > >>>>> >> today, since the RunAutomaton holds onto the Automaton... > >>>>> >> > >>>>> >> > Thanks in advance, > >>>>> >> > >>>>> >> You're welcome! > >>>>> >> > >>>>> >> Mike McCandless > >>>>> >> > >>>>> >> http://blog.mikemccandless.com > >>>>> >> > >>>>> >> ------------------------------------------------------------ > --------- > >>>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>>> >> > >>>>> >> > >>>>> >