[
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969523#comment-13969523
]
Karl Wright edited comment on LUCENE-5584 at 4/15/14 1:12 PM:
--------------------------------------------------------------
Just to answer Robert's comment clearly...
{quote}
But this is the right thing to do. you can compress it however you want, you
can move it to disk (since its like "stored fields" for your top-N), you can do
all kinds of things with it.
{quote}
Our requirement is to be able to track complex arc information (kept now as
values) that corresponds to the text "keys". The problem is how to achieve the
same common-prefix compression as we get out of the box using BytesRef
instances as values, but also still meet our requirement that we be able to
properly assemble the complex information, whether stored directly as values,
or stored in an array indexed by a Long. With an FST<BytesRef>, Lucene stores
the common prefix of all child node values with the parent node, which allows
for complete reconstruction of the value sequence. But with an FST<Long>,
Lucene stores the Math.min of the child node values with the parent node, which
cannot be unique and thus does not permit the complex information to be
reconstructed, unless we are missing something.
What we have right now is a workaround, which does not use common-prefix
compression because it can't. This costs us some 2GB of memory in our use
case, and performance loss on the order of 3-5%. If you have a proposal to use
FST<Long> in a manner that meets our constraints and also allows common-prefix
compression, please let us know what that may be.
was (Author: [email protected]):
Just to answer Robert's comment clearly...
{quote}
But this is the right thing to do. you can compress it however you want, you
can move it to disk (since its like "stored fields" for your top-N), you can do
all kinds of things with it.
{quote}
Our requirement is to be able to track complex information (kept now as values)
that corresponds to the text "keys". The problem is how to achieve the same
common-prefix compression as we get out of the box using BytesRef instances as
values, but also still meet our requirement that we be able to properly
assemble the complex information, whether stored directly as values, or stored
in an array indexed by a Long. With an FST<BytesRef>, Lucene stores the common
prefix of all child node values with the parent node, which allows for complete
reconstruction of the value sequence. But with an FST<Long>, Lucene stores the
Math.min of the child node values with the parent node, which cannot be unique
and thus does not permit the complex information to be reconstructed, unless we
are missing something.
What we have right now is a workaround, which does not use common-prefix
compression because it can't. This costs us some 2GB of memory in our use
case, and performance loss on the order of 3-5%. If you have a proposal to use
FST<Long> in a manner that meets our constraints and also allows common-prefix
compression, please let us know what that may be.
> Allow FST read method to also recycle the output value when traversing FST
> --------------------------------------------------------------------------
>
> Key: LUCENE-5584
> URL: https://issues.apache.org/jira/browse/LUCENE-5584
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/FSTs
> Affects Versions: 4.7.1
> Reporter: Christian Ziech
> Attachments: fst-itersect-benchmark.tgz
>
>
> The FST class heavily reuses Arc instances when traversing the FST. The
> output of an Arc however is not reused. This can especially be important when
> traversing large portions of a FST and using the ByteSequenceOutputs and
> CharSequenceOutputs. Those classes create a new byte[] or char[] for every
> node read (which has an output).
> In our use case we intersect a lucene Automaton with a FST<BytesRef> much
> like it is done in
> org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and
> since the Automaton and the FST are both rather large tens or even hundreds
> of thousands of temporary byte array objects are created.
> One possible solution to the problem would be to change the
> org.apache.lucene.util.fst.Outputs class to have two additional methods (if
> you don't want to change the existing methods for compatibility):
> {code}
> /** Decode an output value previously written with {@link
> * #write(Object, DataOutput)} reusing the object passed in if possible */
> public abstract T read(DataInput in, T reuse) throws IOException;
> /** Decode an output value previously written with {@link
> * #writeFinalOutput(Object, DataOutput)}. By default this
> * just calls {@link #read(DataInput)}. This tries to reuse the object
> * passed in if possible */
> public T readFinalOutput(DataInput in, T reuse) throws IOException {
> return read(in, reuse);
> }
> {code}
> The new methods could then be used in the FST in the readNextRealArc() method
> passing in the output of the reused Arc. For most inputs they could even just
> invoke the original read(in) method.
> If you should decide to make that change I'd be happy to supply a patch
> and/or tests for the feature.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]