[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994934#comment-13994934 ]
Christian Ziech commented on LUCENE-5584: ----------------------------------------- Optimizing our code for intersecting an automaton with an FST (inspired by org.apache.lucene.search.suggest.analyzing.FSTUtil#intersectPrefixPaths) I came across the following locations that create objects that actually could do without: - the "scratchArc" is created for every node in the automaton - for every state in the Automaton an iterator is created implicitly when iterating over the Transitions of the state - outputs.add() creates a new outputs value object for every state of the automaton if the corresponding FST state had an output - for every transition visited a new IntsRef instance is created - for every FST node read a new outputs value object is created All except the last allocation location was fixed easily: - we keep the scratch arcs in a Stack and hence only create one per level of the automaton (about 10-15 levels for us) - we iterate over the states using an int index in the transitions array - we replaced outputs add by our own method that just appends the outputs of the FST Arc to a single outputs value per intersect call and then upon exiting the recursion just removes it again - same goes for the input IntsRef - we have one instance that is just modified as we traverse the automaton/FST For the last allocation location we now have gone with a special Outputs implementation that uses a rather ugly construct to always return the very same outputs instance for the iterate case per Thread. Thinking about the problem again I came to think of another (easier) solution to that problem. If the outputs of the FST wouldn't actually be a field of the FST itself but if they would be under control of the caller of the FST read*Arc methods just like the BytesReader is, we wouldn't have the problem (maybe instead of the BytesReader). That way we could just create a new Outputs instance for each of our intersection runs and wouldn't need to resort that construct which attaches a state to something that is not meant to have a state. > Allow FST read method to also recycle the output value when traversing FST > -------------------------------------------------------------------------- > > Key: LUCENE-5584 > URL: https://issues.apache.org/jira/browse/LUCENE-5584 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs > Affects Versions: 4.7.1 > Reporter: Christian Ziech > Attachments: fst-itersect-benchmark.tgz > > > The FST class heavily reuses Arc instances when traversing the FST. The > output of an Arc however is not reused. This can especially be important when > traversing large portions of a FST and using the ByteSequenceOutputs and > CharSequenceOutputs. Those classes create a new byte[] or char[] for every > node read (which has an output). > In our use case we intersect a lucene Automaton with a FST<BytesRef> much > like it is done in > org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and > since the Automaton and the FST are both rather large tens or even hundreds > of thousands of temporary byte array objects are created. > One possible solution to the problem would be to change the > org.apache.lucene.util.fst.Outputs class to have two additional methods (if > you don't want to change the existing methods for compatibility): > {code} > /** Decode an output value previously written with {@link > * #write(Object, DataOutput)} reusing the object passed in if possible */ > public abstract T read(DataInput in, T reuse) throws IOException; > /** Decode an output value previously written with {@link > * #writeFinalOutput(Object, DataOutput)}. By default this > * just calls {@link #read(DataInput)}. This tries to reuse the object > * passed in if possible */ > public T readFinalOutput(DataInput in, T reuse) throws IOException { > return read(in, reuse); > } > {code} > The new methods could then be used in the FST in the readNextRealArc() method > passing in the output of the reused Arc. For most inputs they could even just > invoke the original read(in) method. > If you should decide to make that change I'd be happy to supply a patch > and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org