[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13965215#comment-13965215
 ] 

Christian Ziech commented on LUCENE-5584:
-----------------------------------------

Thx for the very quick and helpful replies. It seems that I owe you some more 
hard and concrete information on our use case, what we exactly do and our 
environment.
About the environment - the tests were run with
{quote}
java version "1.7.0_45"
OpenJDK Runtime Environment (rhel-2.4.3.4.el6_5-x86_64 u45-b15)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
{quote}
on a CentOS 6.5. Our vm options don't enable the tlab right now but I'm 
definitely consider using it for other reasons. Currently we are running with 
the following (gc relevant) arguments: -Xmx6g -XX:MaxNewSize=700m 
-XX:+UseConcMarkSweepGC -XX:MaxDirectMemorySize=35g. 

I'm not so much worried about the get performance although that could be 
improved as well. We are using lucenes LevenshteinAutomata class to generate a 
couple of Levenshtein automatons with edit distance 1 or 2 (one for each search 
term), build the union of them and intersect them with our FST using a modified 
version of the method 
org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() which 
uses a callback method to push every matched entry instead of returning the 
whole list of paths (for efficiency reasons as well: we don't actually need the 
byte arrays but we want to parse them into a value object, hence reusing the 
output byte array is ok for us).
Our FST has about 500M entires and each entry has a value of approx. 10-20 
bytes. That produces for a random query with 4 terms (and hence a union of 4 
levenshtein automatons) an amount of ~2M visited nodes with output (hence 2M 
created temporary byte []) and a total size ~7.5M for the temporary byte arrays 
(+ the overhead per instance). In that experiment I matched about 10k terms in 
the FST. Those numbers are taking into account that we already used our own add 
implementation that writes to always the same BytesRef instance when adding 
outputs.
The overall impact on the GC and also the execution speed of the method was 
rather significant in total - I can try to dig up numbers for that but they 
would be rather application specific.

Does this help and answers all the questions so far?

Btw: Experimenting a little with the change I noticed that things may be a 
slightly more complicated since the output of a node is often overwritten with 
"NO_OUTPUT" from the Outputs - so that method would need to recycle the current 
output as well if possible but that may have interesting side effects - but 
hopefully that should be manageable.

> Allow FST read method to also recycle the output value when traversing FST
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-5584
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5584
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 4.7.1
>            Reporter: Christian Ziech
>
> The FST class heavily reuses Arc instances when traversing the FST. The 
> output of an Arc however is not reused. This can especially be important when 
> traversing large portions of a FST and using the ByteSequenceOutputs and 
> CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
> node read (which has an output).
> In our use case we intersect a lucene Automaton with a FST<BytesRef> much 
> like it is done in 
> org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
> since the Automaton and the FST are both rather large tens or even hundreds 
> of thousands of temporary byte array objects are created.
> One possible solution to the problem would be to change the 
> org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
> you don't want to change the existing methods for compatibility):
> {code}
>   /** Decode an output value previously written with {@link
>    *  #write(Object, DataOutput)} reusing the object passed in if possible */
>   public abstract T read(DataInput in, T reuse) throws IOException;
>   /** Decode an output value previously written with {@link
>    *  #writeFinalOutput(Object, DataOutput)}.  By default this
>    *  just calls {@link #read(DataInput)}. This tries to  reuse the object   
>    *  passed in if possible */
>   public T readFinalOutput(DataInput in, T reuse) throws IOException {
>     return read(in, reuse);
>   }
> {code}
> The new methods could then be used in the FST in the readNextRealArc() method 
> passing in the output of the reused Arc. For most inputs they could even just 
> invoke the original read(in) method.
> If you should decide to make that change I'd be happy to supply a patch 
> and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to