[jira] [Comment Edited] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-04-15 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969523#comment-13969523
 ] 

Karl Wright edited comment on LUCENE-5584 at 4/15/14 1:12 PM:
--

Just to answer Robert's comment clearly...

{quote}
But this is the right thing to do. you can compress it however you want, you 
can move it to disk (since its like stored fields for your top-N), you can do 
all kinds of things with it.
{quote}

Our requirement is to be able to track complex arc information (kept now as 
values) that corresponds to the text keys.  The problem is how to achieve the 
same common-prefix compression as we get out of the box using BytesRef 
instances as values, but also still meet our requirement that we be able to 
properly assemble the complex information, whether stored directly as values, 
or stored in an array indexed by a Long.  With an FSTBytesRef, Lucene stores 
the common prefix of all child node values with the parent node, which allows 
for complete reconstruction of the value sequence.  But with an FSTLong, 
Lucene stores the Math.min of the child node values with the parent node, which 
cannot be unique and thus does not permit the complex information to be 
reconstructed, unless we are missing something.

What we have right now is a workaround, which does not use common-prefix 
compression because it can't.  This costs us some 2GB of memory in our use 
case, and performance loss on the order of 3-5%.  If you have a proposal to use 
FSTLong in a manner that meets our constraints and also allows common-prefix 
compression, please let us know what that may be.





was (Author: kwri...@metacarta.com):
Just to answer Robert's comment clearly...

{quote}
But this is the right thing to do. you can compress it however you want, you 
can move it to disk (since its like stored fields for your top-N), you can do 
all kinds of things with it.
{quote}

Our requirement is to be able to track complex information (kept now as values) 
that corresponds to the text keys.  The problem is how to achieve the same 
common-prefix compression as we get out of the box using BytesRef instances as 
values, but also still meet our requirement that we be able to properly 
assemble the complex information, whether stored directly as values, or stored 
in an array indexed by a Long.  With an FSTBytesRef, Lucene stores the common 
prefix of all child node values with the parent node, which allows for complete 
reconstruction of the value sequence.  But with an FSTLong, Lucene stores the 
Math.min of the child node values with the parent node, which cannot be unique 
and thus does not permit the complex information to be reconstructed, unless we 
are missing something.

What we have right now is a workaround, which does not use common-prefix 
compression because it can't.  This costs us some 2GB of memory in our use 
case, and performance loss on the order of 3-5%.  If you have a proposal to use 
FSTLong in a manner that meets our constraints and also allows common-prefix 
compression, please let us know what that may be.




 Allow FST read method to also recycle the output value when traversing FST
 --

 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech
 Attachments: fst-itersect-benchmark.tgz


 The FST class heavily reuses Arc instances when traversing the FST. The 
 output of an Arc however is not reused. This can especially be important when 
 traversing large portions of a FST and using the ByteSequenceOutputs and 
 CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
 node read (which has an output).
 In our use case we intersect a lucene Automaton with a FSTBytesRef much 
 like it is done in 
 org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
 since the Automaton and the FST are both rather large tens or even hundreds 
 of thousands of temporary byte array objects are created.
 One possible solution to the problem would be to change the 
 org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
 you don't want to change the existing methods for compatibility):
 {code}
   /** Decode an output value previously written with {@link
*  #write(Object, DataOutput)} reusing the object passed in if possible */
   public abstract T read(DataInput in, T reuse) throws IOException;
   /** Decode an output value previously written with {@link
*  #writeFinalOutput(Object, DataOutput)}.  By default this
*  just calls {@link #read(DataInput)}. This tries to  reuse the object   
*  passed in if possible */
   public T 

[jira] [Comment Edited] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-04-13 Thread Christian Ziech (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967752#comment-13967752
 ] 

Christian Ziech edited comment on LUCENE-5584 at 4/13/14 7:04 AM:
--

{quote}
If this is the case, then I am not sure you are using the correct 
datastructure: it seems to me that a byte sequence output is not appropriate. 
Since you do not care about the intermediate outputs, but have a complicated 
intersection with the FST, why not use a numeric output, pointing to the result 
data somewhere else?
{quote}

That is what we do right now. This however has the downside that we loose the 
prefix compression capability of the FST for the FST values which is 
significant in our case. The single FST with the values attached was roughly 
1.2G large and now with the referenced byte arrays (we load them into a 
DirectByteBuffer) we spend about 2.5G for the values alone. Of course we could 
try to implement the same prefix compression as the FST does on our own and 
fill a byte array while traversing the FST but that feels like copying 
something that is already almost there. If we could just get the extension 
points I mentioned into Lucene without actually changing the actual behavior of 
(most or any) of lucenes code that would be a huge help.

Edit: Also with numeric outputs we still suffer from quite a few unwanted long 
references that are created temporarily by the VM just as the byte arrays were 
before. This problem is far less severe and actually manageable though.



was (Author: christianz):
{quote}
If this is the case, then I am not sure you are using the correct 
datastructure: it seems to me that a byte sequence output is not appropriate. 
Since you do not care about the intermediate outputs, but have a complicated 
intersection with the FST, why not use a numeric output, pointing to the result 
data somewhere else?
{quote}

That is what we do right now. This however has the downside that we loose the 
prefix compression capability of the FST for the FST values which is 
significant in our case. The single FST with the values attached was roughly 
1.2G large and now with the referenced byte arrays (we load them into a 
DirectByteBuffer) we spend about 2.5G for the values alone. Of course we could 
try to implement the same prefix compression as the FST does on our own and 
fill a byte array while traversing the FST but that feels like copying 
something that is already almost there. If we could just get the extension 
points I mentioned into Lucene without actually changing the actual behavior of 
(most or any) of lucenes code that would be a huge help.


 Allow FST read method to also recycle the output value when traversing FST
 --

 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech

 The FST class heavily reuses Arc instances when traversing the FST. The 
 output of an Arc however is not reused. This can especially be important when 
 traversing large portions of a FST and using the ByteSequenceOutputs and 
 CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
 node read (which has an output).
 In our use case we intersect a lucene Automaton with a FSTBytesRef much 
 like it is done in 
 org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
 since the Automaton and the FST are both rather large tens or even hundreds 
 of thousands of temporary byte array objects are created.
 One possible solution to the problem would be to change the 
 org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
 you don't want to change the existing methods for compatibility):
 {code}
   /** Decode an output value previously written with {@link
*  #write(Object, DataOutput)} reusing the object passed in if possible */
   public abstract T read(DataInput in, T reuse) throws IOException;
   /** Decode an output value previously written with {@link
*  #writeFinalOutput(Object, DataOutput)}.  By default this
*  just calls {@link #read(DataInput)}. This tries to  reuse the object   
*  passed in if possible */
   public T readFinalOutput(DataInput in, T reuse) throws IOException {
 return read(in, reuse);
   }
 {code}
 The new methods could then be used in the FST in the readNextRealArc() method 
 passing in the output of the reused Arc. For most inputs they could even just 
 invoke the original read(in) method.
 If you should decide to make that change I'd be happy to supply a patch 
 and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)