[jira] [Comment Edited] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

Karl Wright (JIRA) Tue, 15 Apr 2014 06:12:59 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969523#comment-13969523
 ]


Karl Wright edited comment on LUCENE-5584 at 4/15/14 1:12 PM:
--------------------------------------------------------------

Just to answer Robert's comment clearly...

{quote}
But this is the right thing to do. you can compress it however you want, you 
can move it to disk (since its like "stored fields" for your top-N), you can do 
all kinds of things with it.
{quote}

Our requirement is to be able to track complex arc information (kept now as 
values) that corresponds to the text "keys".  The problem is how to achieve the 
same common-prefix compression as we get out of the box using BytesRef 
instances as values, but also still meet our requirement that we be able to 
properly assemble the complex information, whether stored directly as values, 
or stored in an array indexed by a Long.  With an FST<BytesRef>, Lucene stores 
the common prefix of all child node values with the parent node, which allows 
for complete reconstruction of the value sequence.  But with an FST<Long>, 
Lucene stores the Math.min of the child node values with the parent node, which 
cannot be unique and thus does not permit the complex information to be 
reconstructed, unless we are missing something.

What we have right now is a workaround, which does not use common-prefix 
compression because it can't.  This costs us some 2GB of memory in our use 
case, and performance loss on the order of 3-5%.  If you have a proposal to use 
FST<Long> in a manner that meets our constraints and also allows common-prefix 
compression, please let us know what that may be.





was (Author: [email protected]):
Just to answer Robert's comment clearly...

{quote}
But this is the right thing to do. you can compress it however you want, you 
can move it to disk (since its like "stored fields" for your top-N), you can do 
all kinds of things with it.
{quote}

Our requirement is to be able to track complex information (kept now as values) 
that corresponds to the text "keys".  The problem is how to achieve the same 
common-prefix compression as we get out of the box using BytesRef instances as 
values, but also still meet our requirement that we be able to properly 
assemble the complex information, whether stored directly as values, or stored 
in an array indexed by a Long.  With an FST<BytesRef>, Lucene stores the common 
prefix of all child node values with the parent node, which allows for complete 
reconstruction of the value sequence.  But with an FST<Long>, Lucene stores the 
Math.min of the child node values with the parent node, which cannot be unique 
and thus does not permit the complex information to be reconstructed, unless we 
are missing something.

What we have right now is a workaround, which does not use common-prefix 
compression because it can't.  This costs us some 2GB of memory in our use 
case, and performance loss on the order of 3-5%.  If you have a proposal to use 
FST<Long> in a manner that meets our constraints and also allows common-prefix 
compression, please let us know what that may be.




> Allow FST read method to also recycle the output value when traversing FST
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-5584
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5584
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 4.7.1
>            Reporter: Christian Ziech
>         Attachments: fst-itersect-benchmark.tgz
>
>
> The FST class heavily reuses Arc instances when traversing the FST. The 
> output of an Arc however is not reused. This can especially be important when 
> traversing large portions of a FST and using the ByteSequenceOutputs and 
> CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
> node read (which has an output).
> In our use case we intersect a lucene Automaton with a FST<BytesRef> much 
> like it is done in 
> org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
> since the Automaton and the FST are both rather large tens or even hundreds 
> of thousands of temporary byte array objects are created.
> One possible solution to the problem would be to change the 
> org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
> you don't want to change the existing methods for compatibility):
> {code}
>   /** Decode an output value previously written with {@link
>    *  #write(Object, DataOutput)} reusing the object passed in if possible */
>   public abstract T read(DataInput in, T reuse) throws IOException;
>   /** Decode an output value previously written with {@link
>    *  #writeFinalOutput(Object, DataOutput)}.  By default this
>    *  just calls {@link #read(DataInput)}. This tries to  reuse the object   
>    *  passed in if possible */
>   public T readFinalOutput(DataInput in, T reuse) throws IOException {
>     return read(in, reuse);
>   }
> {code}
> The new methods could then be used in the FST in the readNextRealArc() method 
> passing in the output of the reused Arc. For most inputs they could even just 
> invoke the original read(in) method.
> If you should decide to make that change I'd be happy to supply a patch 
> and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

Reply via email to