[jira] Issue Comment Edited: (LUCENE-2886) Adaptive Frame Of Reference

2011-02-04 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990509#comment-12990509
 ] 

Renaud Delbru edited comment on LUCENE-2886 at 2/4/11 10:42 AM:


Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused 
frame codes, it will not cost too much to add them. This might be useful for 
the frequency inverted lists. However, I am not sure they will be used that 
much. In our experiments, we had a version of AFOR allowing frames of size 8, 
16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the 
order to 0.x% index size reduction, because these cases were occurring very 
rarely. But, this is still better than nothing. However, in the case of 
simple64, we are not talking about small frame (up to 32 integers), but frame 
of 120 to 240 integers. Therefore, I expect to see a drop of probability to 
encounter 120 or 240 consecutive ones. Maybe we can use them for more clever 
configurations such as
- inter-leaved sequences of 1 bit and 2 bits integers
- inter-leaved sequences of 2 bits and 3 bits integers
or something like this.
The best will be to do some tests to see which new configurations will make 
sense, like how many times a allOnes config is selected, or other configs, and 
choose which one to add. But this can be tedious task with only a limited 
benefit.

  was (Author: renaud.delbru):
Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused 
frame codes, it will not cost too much to add them. This might be useful for 
the frequency inverted lists. However, I am not sure they will be used that 
much. In our experiments, we had a version of AFOR allowing frames of size 8, 
16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the 
order to 0.x% index size reduction, because these cases were occurring very 
rarely. But, this is still better than nothing. However, in the case of 
simple64, we are not talking about small frame (up to 32 integers), but frame 
of 120 to 240 integers. Therefore, I expect to see a drop of probability to 
encounter 120 or 240 consecutive ones. Maybe we can use them for more clever 
configurations such as
- inter-leaved sequences of 1 bit and 2 bits integers
- inter-leaved sequences of 2 bits and 3 bits integers
or something like this.
The best will be to do some tests to see which new configurations will make 
sense, like how many times a allOnes config is selected, or other configs, and 
choose which one to add.
  
 Adaptive Frame Of Reference 
 

 Key: LUCENE-2886
 URL: https://issues.apache.org/jira/browse/LUCENE-2886
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Codecs
Reporter: Renaud Delbru
 Fix For: 4.0

 Attachments: LUCENE-2886_simple64.patch, 
 LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz


 We could test the implementation of the Adaptive Frame Of Reference [1] on 
 the lucene-4.0 branch.
 I am providing the source code of its implementation. Some work needs to be 
 done, as this implementation is working on the old lucene-1458 branch. 
 I will attach a tarball containing a running version (with tests) of the AFOR 
 implementation, as well as the implementations of PFOR and of Simple64 
 (simple family codec working on 64bits word) that has been used in the 
 experiments in [1].
 [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2886) Adaptive Frame Of Reference

2011-02-04 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990538#comment-12990538
 ] 

Renaud Delbru edited comment on LUCENE-2886 at 2/4/11 12:05 PM:


Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 
(AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to 
beat Rice on two datasets, and S-64 on one (but it was very close to Rice on 
the others):

DBpedia dataset: (structured version of wikipedia)

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816|
|AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758|
|AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736|
|FOR|0.315|0.061|0.170|0.117|0.216|1.049|
|PFOR|0.317|0.044|0.155|0.070|0.205|0.946|
|Rice|0.240|0.029|0.115|0.057|0.152|0.708|
|S-64|0.249|0.041|0.133|0.062|0.171|0.791|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Geonames Dataset: 

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318|
|AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307|
|AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256|
|FOR|0.150|0.021|0.065|0.025|0.023|0.349|
|PFOR|0.154|0.019|0.057|0.022|0.023|0.332|
|Rice|0.133|0.019|0.063|0.029|0.021|0.327|
|S-64|0.147|0.021|0.058|0.023|0.023|0.329|
|VByte|0.216|0.142|0.143|0.143|0.143|0.929|

Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of 
web dataset

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537|
|AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082|
|AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475|
|FOR|3.506|0.506|1.121|0.916|1.440|8.611|
|PFOR|3.221|0.374|1.153|0.795|1.227|7.924|
|Rice|2.721|0.314|0.958|0.714|0.941|6.605|
|S-64|2.581|0.370|0.917|0.621|0.908|6.313|
|VByte|3.287|2.106|2.411|2.430|2.488|15.132|

Here, Ent refers to entity id (similar to doc id), Att and Val are structural 
node ids.

  was (Author: renaud.delbru):
Just an additional comment on semi-structured data indexing. AFOR-2 and 
AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able 
to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on 
the others):

DBpedia dataset: (structured version of wikipedia)

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816|
|AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758|
|AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736|
|FOR|0.315|0.061|0.170|0.117|0.216|1.049|
|PFOR|0.317|0.044|0.155|0.070|0.205|0.946|
|Rice|0.240|0.029|0.115|0.057|0.152|0.708|
|S-64|0.249|0.041|0.133|0.062|0.171|0.791|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Geonames Dataset: 

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318|
|AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307|
|AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256|
|FOR|0.150|0.021|0.065|0.025|0.023|0.349|
|PFOR|0.154|0.019|0.057|0.022|0.023|0.332|
|Rice|0.133|0.019|0.063|0.029|0.021|0.327|
|S-64|0.147|0.021|0.058|0.023|0.023|0.329|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of 
web dataset

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537|
|AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082|
|AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475|
|FOR|3.506|0.506|1.121|0.916|1.440|8.611|
|PFOR|3.221|0.374|1.153|0.795|1.227|7.924|
|Rice|2.721|0.314|0.958|0.714|0.941|6.605|
|S-64|2.581|0.370|0.917|0.621|0.908|6.313|
|VByte|3.287|2.106|2.411|2.430|2.488|15.132|

Here, Ent refers to entity id (similar to doc id), Att and Val are structural 
node ids.
  
 Adaptive Frame Of Reference 
 

 Key: LUCENE-2886
 URL: https://issues.apache.org/jira/browse/LUCENE-2886
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Codecs
Reporter: Renaud Delbru
 Fix For: 4.0

 Attachments: LUCENE-2886_simple64.patch, 
 LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz


 We could test the implementation of the Adaptive Frame Of Reference [1] on 
 the lucene-4.0 branch.
 I am providing the source code of its implementation. Some work needs to be 
 done, as this implementation is working on the old lucene-1458 branch. 
 I will attach a tarball containing a running version (with tests) of the AFOR 
 implementation, as well as the implementations of PFOR and of Simple64 
 (simple family codec working on 64bits word) that has been used in the 
 experiments in [1].
 [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For