[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964384#comment-13964384
 ] 

Michael McCandless commented on LUCENE-5584:
--------------------------------------------

It would be nice to allow reuse of outputs, for types are are re-usable (e.g. 
BytesRef, not Long).

The methods wouldn't need to be abstract right?  They could by default fallback 
to the non-reuse method (i.e. ignore the incoming reuse parameter).

Using this API might be a bit tricky, e.g. Util.get accumulates the output as 
it goes, and it needs both the current output, and the new one it just read 
from the arc, to be available simultaneously so that it can call Outputs.add.  
I wonder if we could do the re-use there, e.g. allow add to return modify one 
of its incoming arguments?

But one thing to remember here: this garbage is very short-lived and modern 
JVMs are usually very fast at collecting such garbage.  Also, if you have an 
FST that has longish byte[] outputs, this is going to be possibly very slow 
even if we enable re-use, because it's at heart O(N^2) to accumulate the 
outputs, because it copies the entire byte[] each time it needs to append a bit 
more onto the end.  (It's like concatenating String instead of using 
StringBuilder).  So if that's the root cause of the slowness you are seeing, 
re-use alone won't fix it, unless we can do something with add and e.g. have a 
ByteArrayBuilder sort of output?

> Allow FST read method to also recycle the output value when traversing FST
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-5584
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5584
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 4.7.1
>            Reporter: Christian Ziech
>
> The FST class heavily reuses Arc instances when traversing the FST. The 
> output of an Arc however is not reused. This can especially be important when 
> traversing large portions of a FST and using the ByteSequenceOutputs and 
> CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
> node read (which has an output).
> In our use case we intersect a lucene Automaton with a FST<BytesRef> much 
> like it is done in 
> org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
> since the Automaton and the FST are both rather large tens or even hundreds 
> of thousands of temporary byte array objects are created.
> One possible solution to the problem would be to change the 
> org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
> you don't want to change the existing methods for compatibility):
> {code}
>   /** Decode an output value previously written with {@link
>    *  #write(Object, DataOutput)} reusing the object passed in if possible */
>   public abstract T read(DataInput in, T reuse) throws IOException;
>   /** Decode an output value previously written with {@link
>    *  #writeFinalOutput(Object, DataOutput)}.  By default this
>    *  just calls {@link #read(DataInput)}. This tries to  reuse the object   
>    *  passed in if possible */
>   public T readFinalOutput(DataInput in, T reuse) throws IOException {
>     return read(in, reuse);
>   }
> {code}
> The new methods could then be used in the FST in the readNextRealArc() method 
> passing in the output of the reused Arc. For most inputs they could even just 
> invoke the original read(in) method.
> If you should decide to make that change I'd be happy to supply a patch 
> and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to