[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-4609: --------------------------------------- Attachment: LUCENE-4609.patch Patch, w/ a "custom" (not using our PackedInts APIs) packed ints encoder/decoder. It only uses as many bytes as are necessary, and packs bpv & "leftoverBits" into a single byte header. I tested on first 1M Wikipedia docs ... and performance is much worse than current default in trunk... admittedly it's not quite fair (trunk has specialized vInt/dGap decoder, but patch leaves dGap separate from packed int decode), and admittedly this decoder will be slower than the optimized oal.util.PackedInts ... but perf is so far off that I find it hard to believe PackedInts can match vInt even after optimizing. Trunk gets these results: {noformat} Task QPS base StdDev QPS comp StdDev Pct diff PKLookup 203.77 (1.8%) 202.25 (1.8%) -0.7% ( -4% - 2%) HighTerm 20.43 (1.8%) 20.53 (0.8%) 0.5% ( -2% - 3%) MedTerm 33.12 (1.7%) 33.30 (0.9%) 0.5% ( -2% - 3%) LowTerm 87.55 (3.0%) 88.59 (2.5%) 1.2% ( -4% - 6%) {noformat} Patch gets this: {noformat} Task QPS base StdDev QPS comp StdDev Pct diff HighTerm 10.82 (3.6%) 10.69 (4.4%) -1.2% ( -8% - 7%) MedTerm 19.33 (3.2%) 19.10 (4.0%) -1.2% ( -8% - 6%) LowTerm 67.75 (2.8%) 67.11 (3.0%) -0.9% ( -6% - 5%) PKLookup 196.49 (1.0%) 196.24 (1.9%) -0.1% ( -3% - 2%) {noformat} (NOTE: base/comp are the same in each run, so ignore the differences w/in each run (it's noise) and compare absolute across the two runs ... ie HighTerm gets ~20.43 QPS with trunk but ~10.82 with patch). Also: trunk took ~63 MB for the DV files while patch took ~84 MB. Net/net I think postings compress better with PackedInts than facet ords (at least for these 9 facet fields I'm using in Wikipedia)... > Write a PackedIntsEncoder/Decoder for facets > -------------------------------------------- > > Key: LUCENE-4609 > URL: https://issues.apache.org/jira/browse/LUCENE-4609 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet > Reporter: Shai Erera > Priority: Minor > Attachments: LUCENE-4609.patch, LUCENE-4609.patch > > > Today the facets API lets you write IntEncoder/Decoder to encode/decode the > category ordinals. We have several such encoders, including VInt (default), > and block encoders. > It would be interesting to implement and benchmark a > PackedIntsEncoder/Decoder, with potentially two variants: (1) receives > bitsPerValue up front, when you e.g. know that you have a small taxonomy and > the max value you can see and (2) one that decides for each doc on the > optimal bitsPerValue, writes it as a header in the byte[] or something. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org