[jira] Commented: (LUCENE-1410) PFOR implementation

Paul Elschot (JIRA) Fri, 03 Oct 2008 13:24:36 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636730#action_12636730
 ]


Paul Elschot commented on LUCENE-1410:
--------------------------------------

The number of bits is not really informative, it would be better to have a 
distribution of the result of getNumFrameBits(). Then we can see for which nrs 
of bits loop unrolling might be tried.

As for decompression speed, please remember that the patching code (that 
decodes the exceptions into the output) has not yet been optimized at all.

The lucene .frq file contains the docids as deltas and the frequencies but with 
a special encoding in case the frequency is one. I'd rather try compressing the 
real delta docids and the real frequencies than the encoded versions.

The .prx file should be useable as it is, and from the results reported in the 
articles I would expect a real performance advantage for PFor for that.
Michael, could you post a distribution of the number of frame bits for the .prx 
file? I'd like to know a reasonable maximum for unrolling the corresponding 
loops.

>: maybe I'm doing something wrong
I don't think so, the code is still young. Try running TestPFor and have a look 
at the output near the end for the case of 6 frame bits. That should show the 
unrolled decoding speed for the case without exceptions. Then compare that to 
the cases with lower and higher nrs of frame bits.

>: Stepping back a bit: I wonder what %tg of the time in a "typical"
Lucene search is spent decoding vInts? That would bound how much
improvement we could ever expect from this optimization during
searching.

A: there is also the influence of the reduction of data to be fetched (via 
various caches) from the index. As reported in the articles, PFor strikes a 
nice optimum between decompression speed and fetching speed from (fast) disks.

>: I was thinking local CPU's native asm.
A: I'd try a C version first. Iirc gcc has a decent optimizer for bit ops, but 
it's been a while for me that I used C.

For the record, to show the variation in decompression speeds for different 
numbers of frame bits without exceptions, here is my current output from 
TestPFor:
{noformat}
test901PerfFor1Decompress starting
Compressed 128 bytes into 8, ratio 0.0625
test901PerfFor1Decompress 0 numFrameBits 1 decompressed 104857600 in 1021 
msecs, 102700 ints/msec, (25 iters).
test901PerfFor1Decompress 1 numFrameBits 1 decompressed 171966464 in 1020 
msecs, 168594 ints/msec, (41 iters).
test901PerfFor1Decompress 2 numFrameBits 1 decompressed 171966464 in 1012 
msecs, 169927 ints/msec, (41 iters).

test902PerfFor2Decompress starting
Compressed 128 bytes into 12, ratio 0.09375
test902PerfFor2Decompress 0 numFrameBits 2 decompressed 130023424 in 1017 
msecs, 127849 ints/msec, (31 iters).
test902PerfFor2Decompress 1 numFrameBits 2 decompressed 155189248 in 1000 
msecs, 155189 ints/msec, (37 iters).
test902PerfFor2Decompress 2 numFrameBits 2 decompressed 159383552 in 1022 
msecs, 155952 ints/msec, (38 iters).

test903PerfFor3Decompress starting
Compressed 128 bytes into 16, ratio 0.125
test903PerfFor3Decompress 0 numFrameBits 3 decompressed 188743680 in 1016 
msecs, 185771 ints/msec, (45 iters).
test903PerfFor3Decompress 1 numFrameBits 3 decompressed 205520896 in 1018 
msecs, 201886 ints/msec, (49 iters).
test903PerfFor3Decompress 2 numFrameBits 3 decompressed 201326592 in 1026 
msecs, 196224 ints/msec, (48 iters).

test904PerfFor4Decompress starting
Compressed 128 bytes into 20, ratio 0.15625
test904PerfFor4Decompress 0 numFrameBits 4 decompressed 146800640 in 1014 
msecs, 144773 ints/msec, (35 iters).
test904PerfFor4Decompress 1 numFrameBits 4 decompressed 159383552 in 1007 
msecs, 158275 ints/msec, (38 iters).
test904PerfFor4Decompress 2 numFrameBits 4 decompressed 159383552 in 1011 
msecs, 157649 ints/msec, (38 iters).

test905PerfFor5Decompress starting
Compressed 128 bytes into 24, ratio 0.1875
test905PerfFor5Decompress 0 numFrameBits 5 decompressed 159383552 in 1010 
msecs, 157805 ints/msec, (38 iters).
test905PerfFor5Decompress 1 numFrameBits 5 decompressed 176160768 in 1009 
msecs, 174589 ints/msec, (42 iters).
test905PerfFor5Decompress 2 numFrameBits 5 decompressed 176160768 in 1004 
msecs, 175458 ints/msec, (42 iters).

test906PerfFor6Decompress starting
Compressed 128 bytes into 28, ratio 0.21875
test906PerfFor6Decompress 0 numFrameBits 6 decompressed 117440512 in 1001 
msecs, 117323 ints/msec, (28 iters).
test906PerfFor6Decompress 1 numFrameBits 6 decompressed 125829120 in 1002 
msecs, 125577 ints/msec, (30 iters).
test906PerfFor6Decompress 2 numFrameBits 6 decompressed 125829120 in 1004 
msecs, 125327 ints/msec, (30 iters).

test907PerfFor7Decompress starting
Compressed 128 bytes into 32, ratio 0.25
test907PerfFor7Decompress 0 numFrameBits 7 decompressed 125829120 in 1016 
msecs, 123847 ints/msec, (30 iters).
test907PerfFor7Decompress 1 numFrameBits 7 decompressed 150994944 in 1015 
msecs, 148763 ints/msec, (36 iters).
test907PerfFor7Decompress 2 numFrameBits 7 decompressed 150994944 in 1012 
msecs, 149204 ints/msec, (36 iters).

test908PerfFor8Decompress starting
Compressed 124 bytes into 35, ratio 0.28225806
test908PerfFor8Decompress 0 numFrameBits 8 decompressed 85327872 in 1015 msecs, 
84066 ints/msec, (21 iters).
test908PerfFor8Decompress 1 numFrameBits 8 decompressed 121896960 in 1021 
msecs, 119389 ints/msec, (30 iters).
test908PerfFor8Decompress 2 numFrameBits 8 decompressed 121896960 in 1022 
msecs, 119272 ints/msec, (30 iters).

test909PerfFor9Decompress starting
Compressed 124 bytes into 39, ratio 0.31451613
test909PerfFor9Decompress 0 numFrameBits 9 decompressed 52822016 in 1016 msecs, 
51990 ints/msec, (13 iters).
test909PerfFor9Decompress 1 numFrameBits 9 decompressed 56885248 in 1028 msecs, 
55335 ints/msec, (14 iters).
test909PerfFor9Decompress 2 numFrameBits 9 decompressed 56885248 in 1029 msecs, 
55282 ints/msec, (14 iters).
{noformat}
The loop for 9 bits is not unrolled, unrolling makes it a tiny little bit 
(55->54 Mint/sec) slower.

> PFOR implementation
> -------------------
>
>                 Key: LUCENE-1410
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1410
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Paul Elschot
>            Priority: Minor
>         Attachments: LUCENE-1410b.patch, TestPFor2.java, TestPFor2.java
>
>   Original Estimate: 21840h
>  Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1410) PFOR implementation

Reply via email to