[ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990538#comment-12990538 ]
Renaud Delbru edited comment on LUCENE-2886 at 2/4/11 12:05 PM: ---------------------------------------------------------------- Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on the others): DBpedia dataset: (structured version of wikipedia) ||Method||Ent||Frq||Att||Val||Pos||Total|| |AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816| |AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758| |AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736| |FOR|0.315|0.061|0.170|0.117|0.216|1.049| |PFOR|0.317|0.044|0.155|0.070|0.205|0.946| |Rice|0.240|0.029|0.115|0.057|0.152|0.708| |S-64|0.249|0.041|0.133|0.062|0.171|0.791| |VByte|0.264|0.162|0.222|0.222|0.245|1.335| Geonames Dataset: ||Method||Ent||Frq||Att||Val||Pos||Total|| |AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318| |AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307| |AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256| |FOR|0.150|0.021|0.065|0.025|0.023|0.349| |PFOR|0.154|0.019|0.057|0.022|0.023|0.332| |Rice|0.133|0.019|0.063|0.029|0.021|0.327| |S-64|0.147|0.021|0.058|0.023|0.023|0.329| |VByte|0.216|0.142|0.143|0.143|0.143|0.929| Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of web dataset ||Method||Ent||Frq||Att||Val||Pos||Total|| |AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537| |AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082| |AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475| |FOR|3.506|0.506|1.121|0.916|1.440|8.611| |PFOR|3.221|0.374|1.153|0.795|1.227|7.924| |Rice|2.721|0.314|0.958|0.714|0.941|6.605| |S-64|2.581|0.370|0.917|0.621|0.908|6.313| |VByte|3.287|2.106|2.411|2.430|2.488|15.132| Here, Ent refers to entity id (similar to doc id), Att and Val are structural node ids. was (Author: renaud.delbru): Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on the others): DBpedia dataset: (structured version of wikipedia) ||Method||Ent||Frq||Att||Val||Pos||Total|| |AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816| |AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758| |AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736| |FOR|0.315|0.061|0.170|0.117|0.216|1.049| |PFOR|0.317|0.044|0.155|0.070|0.205|0.946| |Rice|0.240|0.029|0.115|0.057|0.152|0.708| |S-64|0.249|0.041|0.133|0.062|0.171|0.791| |VByte|0.264|0.162|0.222|0.222|0.245|1.335| Geonames Dataset: ||Method||Ent||Frq||Att||Val||Pos||Total|| |AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318| |AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307| |AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256| |FOR|0.150|0.021|0.065|0.025|0.023|0.349| |PFOR|0.154|0.019|0.057|0.022|0.023|0.332| |Rice|0.133|0.019|0.063|0.029|0.021|0.327| |S-64|0.147|0.021|0.058|0.023|0.023|0.329| |VByte|0.264|0.162|0.222|0.222|0.245|1.335| Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of web dataset ||Method||Ent||Frq||Att||Val||Pos||Total|| |AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537| |AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082| |AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475| |FOR|3.506|0.506|1.121|0.916|1.440|8.611| |PFOR|3.221|0.374|1.153|0.795|1.227|7.924| |Rice|2.721|0.314|0.958|0.714|0.941|6.605| |S-64|2.581|0.370|0.917|0.621|0.908|6.313| |VByte|3.287|2.106|2.411|2.430|2.488|15.132| Here, Ent refers to entity id (similar to doc id), Att and Val are structural node ids. > Adaptive Frame Of Reference > ---------------------------- > > Key: LUCENE-2886 > URL: https://issues.apache.org/jira/browse/LUCENE-2886 > Project: Lucene - Java > Issue Type: New Feature > Components: Codecs > Reporter: Renaud Delbru > Fix For: 4.0 > > Attachments: LUCENE-2886_simple64.patch, > LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz > > > We could test the implementation of the Adaptive Frame Of Reference [1] on > the lucene-4.0 branch. > I am providing the source code of its implementation. Some work needs to be > done, as this implementation is working on the old lucene-1458 branch. > I will attach a tarball containing a running version (with tests) of the AFOR > implementation, as well as the implementations of PFOR and of Simple64 > (simple family codec working on 64bits word) that has been used in the > experiments in [1]. > [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org