[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994934#comment-13994934
 ] 

Christian Ziech commented on LUCENE-5584:
-----------------------------------------

Optimizing our code for intersecting an automaton with an FST (inspired by 
org.apache.lucene.search.suggest.analyzing.FSTUtil#intersectPrefixPaths) I came 
across the following locations that create objects that actually could do 
without:
- the "scratchArc" is created for every node in the automaton
- for every state in the Automaton an iterator is created implicitly when 
iterating over the Transitions of the state
- outputs.add() creates a new outputs value object for every state of the 
automaton if the corresponding FST state had an output
- for every transition visited a new IntsRef instance is created
- for every FST node read a new outputs value object is created

All except the last allocation location was fixed easily:
- we keep the scratch arcs in a Stack and hence only create one per level of 
the automaton (about 10-15 levels for us)
- we iterate over the states using an int index in the transitions array
- we replaced outputs add by our own method that just appends the outputs of 
the FST Arc to a single outputs value per intersect call and then upon exiting 
the recursion just removes it again
- same goes for the input IntsRef - we have one instance that is just modified 
as we traverse the automaton/FST

For the last allocation location we now have gone with a special Outputs 
implementation that uses a rather ugly construct to always return the very same 
outputs instance for the iterate case per Thread. Thinking about the problem 
again I came to think of another (easier) solution to that problem. If the 
outputs of the FST wouldn't actually be a field of the FST itself but if they 
would be under control of the caller of the FST read*Arc methods just like the 
BytesReader is, we wouldn't have the problem (maybe instead of the 
BytesReader). That way we could just create a new Outputs instance for each of 
our intersection runs and wouldn't need to resort that construct which attaches 
a state to something that is not meant to have a state.

> Allow FST read method to also recycle the output value when traversing FST
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-5584
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5584
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 4.7.1
>            Reporter: Christian Ziech
>         Attachments: fst-itersect-benchmark.tgz
>
>
> The FST class heavily reuses Arc instances when traversing the FST. The 
> output of an Arc however is not reused. This can especially be important when 
> traversing large portions of a FST and using the ByteSequenceOutputs and 
> CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
> node read (which has an output).
> In our use case we intersect a lucene Automaton with a FST<BytesRef> much 
> like it is done in 
> org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
> since the Automaton and the FST are both rather large tens or even hundreds 
> of thousands of temporary byte array objects are created.
> One possible solution to the problem would be to change the 
> org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
> you don't want to change the existing methods for compatibility):
> {code}
>   /** Decode an output value previously written with {@link
>    *  #write(Object, DataOutput)} reusing the object passed in if possible */
>   public abstract T read(DataInput in, T reuse) throws IOException;
>   /** Decode an output value previously written with {@link
>    *  #writeFinalOutput(Object, DataOutput)}.  By default this
>    *  just calls {@link #read(DataInput)}. This tries to  reuse the object   
>    *  passed in if possible */
>   public T readFinalOutput(DataInput in, T reuse) throws IOException {
>     return read(in, reuse);
>   }
> {code}
> The new methods could then be used in the FST in the readNextRealArc() method 
> passing in the output of the reused Arc. For most inputs they could even just 
> invoke the original read(in) method.
> If you should decide to make that change I'd be happy to supply a patch 
> and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to