[jira] [Comment Edited] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969523#comment-13969523 ] Karl Wright edited comment on LUCENE-5584 at 4/15/14 1:12 PM: -- Just to answer Robert's comment clearly... {quote} But this is the right thing to do. you can compress it however you want, you can move it to disk (since its like stored fields for your top-N), you can do all kinds of things with it. {quote} Our requirement is to be able to track complex arc information (kept now as values) that corresponds to the text keys. The problem is how to achieve the same common-prefix compression as we get out of the box using BytesRef instances as values, but also still meet our requirement that we be able to properly assemble the complex information, whether stored directly as values, or stored in an array indexed by a Long. With an FSTBytesRef, Lucene stores the common prefix of all child node values with the parent node, which allows for complete reconstruction of the value sequence. But with an FSTLong, Lucene stores the Math.min of the child node values with the parent node, which cannot be unique and thus does not permit the complex information to be reconstructed, unless we are missing something. What we have right now is a workaround, which does not use common-prefix compression because it can't. This costs us some 2GB of memory in our use case, and performance loss on the order of 3-5%. If you have a proposal to use FSTLong in a manner that meets our constraints and also allows common-prefix compression, please let us know what that may be. was (Author: kwri...@metacarta.com): Just to answer Robert's comment clearly... {quote} But this is the right thing to do. you can compress it however you want, you can move it to disk (since its like stored fields for your top-N), you can do all kinds of things with it. {quote} Our requirement is to be able to track complex information (kept now as values) that corresponds to the text keys. The problem is how to achieve the same common-prefix compression as we get out of the box using BytesRef instances as values, but also still meet our requirement that we be able to properly assemble the complex information, whether stored directly as values, or stored in an array indexed by a Long. With an FSTBytesRef, Lucene stores the common prefix of all child node values with the parent node, which allows for complete reconstruction of the value sequence. But with an FSTLong, Lucene stores the Math.min of the child node values with the parent node, which cannot be unique and thus does not permit the complex information to be reconstructed, unless we are missing something. What we have right now is a workaround, which does not use common-prefix compression because it can't. This costs us some 2GB of memory in our use case, and performance loss on the order of 3-5%. If you have a proposal to use FSTLong in a manner that meets our constraints and also allows common-prefix compression, please let us know what that may be. Allow FST read method to also recycle the output value when traversing FST -- Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech Attachments: fst-itersect-benchmark.tgz The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T
[jira] [Comment Edited] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967752#comment-13967752 ] Christian Ziech edited comment on LUCENE-5584 at 4/13/14 7:04 AM: -- {quote} If this is the case, then I am not sure you are using the correct datastructure: it seems to me that a byte sequence output is not appropriate. Since you do not care about the intermediate outputs, but have a complicated intersection with the FST, why not use a numeric output, pointing to the result data somewhere else? {quote} That is what we do right now. This however has the downside that we loose the prefix compression capability of the FST for the FST values which is significant in our case. The single FST with the values attached was roughly 1.2G large and now with the referenced byte arrays (we load them into a DirectByteBuffer) we spend about 2.5G for the values alone. Of course we could try to implement the same prefix compression as the FST does on our own and fill a byte array while traversing the FST but that feels like copying something that is already almost there. If we could just get the extension points I mentioned into Lucene without actually changing the actual behavior of (most or any) of lucenes code that would be a huge help. Edit: Also with numeric outputs we still suffer from quite a few unwanted long references that are created temporarily by the VM just as the byte arrays were before. This problem is far less severe and actually manageable though. was (Author: christianz): {quote} If this is the case, then I am not sure you are using the correct datastructure: it seems to me that a byte sequence output is not appropriate. Since you do not care about the intermediate outputs, but have a complicated intersection with the FST, why not use a numeric output, pointing to the result data somewhere else? {quote} That is what we do right now. This however has the downside that we loose the prefix compression capability of the FST for the FST values which is significant in our case. The single FST with the values attached was roughly 1.2G large and now with the referenced byte arrays (we load them into a DirectByteBuffer) we spend about 2.5G for the values alone. Of course we could try to implement the same prefix compression as the FST does on our own and fill a byte array while traversing the FST but that feels like copying something that is already almost there. If we could just get the extension points I mentioned into Lucene without actually changing the actual behavior of (most or any) of lucenes code that would be a huge help. Allow FST read method to also recycle the output value when traversing FST -- Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252)