[jira] Issue Comment Edited: (LUCENE-2886) Adaptive Frame Of Reference
[ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990509#comment-12990509 ] Renaud Delbru edited comment on LUCENE-2886 at 2/4/11 10:42 AM: Hi Michael, Robert, great to hear that the code is useful, looking forward to see some benchmark. I think the VarIntBlock approach is a good idea. Concerning the two unused frame codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much. In our experiments, we had a version of AFOR allowing frames of size 8, 16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the order to 0.x% index size reduction, because these cases were occurring very rarely. But, this is still better than nothing. However, in the case of simple64, we are not talking about small frame (up to 32 integers), but frame of 120 to 240 integers. Therefore, I expect to see a drop of probability to encounter 120 or 240 consecutive ones. Maybe we can use them for more clever configurations such as - inter-leaved sequences of 1 bit and 2 bits integers - inter-leaved sequences of 2 bits and 3 bits integers or something like this. The best will be to do some tests to see which new configurations will make sense, like how many times a allOnes config is selected, or other configs, and choose which one to add. But this can be tedious task with only a limited benefit. was (Author: renaud.delbru): Hi Michael, Robert, great to hear that the code is useful, looking forward to see some benchmark. I think the VarIntBlock approach is a good idea. Concerning the two unused frame codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much. In our experiments, we had a version of AFOR allowing frames of size 8, 16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the order to 0.x% index size reduction, because these cases were occurring very rarely. But, this is still better than nothing. However, in the case of simple64, we are not talking about small frame (up to 32 integers), but frame of 120 to 240 integers. Therefore, I expect to see a drop of probability to encounter 120 or 240 consecutive ones. Maybe we can use them for more clever configurations such as - inter-leaved sequences of 1 bit and 2 bits integers - inter-leaved sequences of 2 bits and 3 bits integers or something like this. The best will be to do some tests to see which new configurations will make sense, like how many times a allOnes config is selected, or other configs, and choose which one to add. Adaptive Frame Of Reference Key: LUCENE-2886 URL: https://issues.apache.org/jira/browse/LUCENE-2886 Project: Lucene - Java Issue Type: New Feature Components: Codecs Reporter: Renaud Delbru Fix For: 4.0 Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch. I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1]. [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2886) Adaptive Frame Of Reference
[ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990538#comment-12990538 ] Renaud Delbru edited comment on LUCENE-2886 at 2/4/11 12:05 PM: Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on the others): DBpedia dataset: (structured version of wikipedia) ||Method||Ent||Frq||Att||Val||Pos||Total|| |AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816| |AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758| |AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736| |FOR|0.315|0.061|0.170|0.117|0.216|1.049| |PFOR|0.317|0.044|0.155|0.070|0.205|0.946| |Rice|0.240|0.029|0.115|0.057|0.152|0.708| |S-64|0.249|0.041|0.133|0.062|0.171|0.791| |VByte|0.264|0.162|0.222|0.222|0.245|1.335| Geonames Dataset: ||Method||Ent||Frq||Att||Val||Pos||Total|| |AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318| |AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307| |AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256| |FOR|0.150|0.021|0.065|0.025|0.023|0.349| |PFOR|0.154|0.019|0.057|0.022|0.023|0.332| |Rice|0.133|0.019|0.063|0.029|0.021|0.327| |S-64|0.147|0.021|0.058|0.023|0.023|0.329| |VByte|0.216|0.142|0.143|0.143|0.143|0.929| Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of web dataset ||Method||Ent||Frq||Att||Val||Pos||Total|| |AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537| |AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082| |AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475| |FOR|3.506|0.506|1.121|0.916|1.440|8.611| |PFOR|3.221|0.374|1.153|0.795|1.227|7.924| |Rice|2.721|0.314|0.958|0.714|0.941|6.605| |S-64|2.581|0.370|0.917|0.621|0.908|6.313| |VByte|3.287|2.106|2.411|2.430|2.488|15.132| Here, Ent refers to entity id (similar to doc id), Att and Val are structural node ids. was (Author: renaud.delbru): Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on the others): DBpedia dataset: (structured version of wikipedia) ||Method||Ent||Frq||Att||Val||Pos||Total|| |AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816| |AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758| |AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736| |FOR|0.315|0.061|0.170|0.117|0.216|1.049| |PFOR|0.317|0.044|0.155|0.070|0.205|0.946| |Rice|0.240|0.029|0.115|0.057|0.152|0.708| |S-64|0.249|0.041|0.133|0.062|0.171|0.791| |VByte|0.264|0.162|0.222|0.222|0.245|1.335| Geonames Dataset: ||Method||Ent||Frq||Att||Val||Pos||Total|| |AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318| |AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307| |AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256| |FOR|0.150|0.021|0.065|0.025|0.023|0.349| |PFOR|0.154|0.019|0.057|0.022|0.023|0.332| |Rice|0.133|0.019|0.063|0.029|0.021|0.327| |S-64|0.147|0.021|0.058|0.023|0.023|0.329| |VByte|0.264|0.162|0.222|0.222|0.245|1.335| Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of web dataset ||Method||Ent||Frq||Att||Val||Pos||Total|| |AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537| |AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082| |AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475| |FOR|3.506|0.506|1.121|0.916|1.440|8.611| |PFOR|3.221|0.374|1.153|0.795|1.227|7.924| |Rice|2.721|0.314|0.958|0.714|0.941|6.605| |S-64|2.581|0.370|0.917|0.621|0.908|6.313| |VByte|3.287|2.106|2.411|2.430|2.488|15.132| Here, Ent refers to entity id (similar to doc id), Att and Val are structural node ids. Adaptive Frame Of Reference Key: LUCENE-2886 URL: https://issues.apache.org/jira/browse/LUCENE-2886 Project: Lucene - Java Issue Type: New Feature Components: Codecs Reporter: Renaud Delbru Fix For: 4.0 Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch. I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1]. [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For