[
https://issues.apache.org/jira/browse/LUCENE-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-2816:
--------------------------------
Attachment: LUCENE-2816.patch
Here's the most important benchmark: speeding up the MultiMMap's readByte(s) in
general:
MultiMMapIndexInput readByte(s) improvements [trunk, Standard codec]
||Query||QPS trunk||QPS patch||Pct diff||||
|spanFirst(unit, 5)|12.72|12.85|{color:green}1.0%{color}|
|+nebraska +state|137.47|139.33|{color:green}1.3%{color}|
|spanNear([unit, state], 10, true)|2.90|2.94|{color:green}1.4%{color}|
|"unit state"|5.88|5.99|{color:green}1.8%{color}|
|unit~2.0|7.06|7.20|{color:green}2.0%{color}|
|+unit +state|8.68|8.87|{color:green}2.2%{color}|
|unit state|8.00|8.23|{color:green}2.9%{color}|
|unit~1.0|7.19|7.41|{color:green}3.0%{color}|
|unit*|22.66|23.41|{color:green}3.3%{color}|
|uni*|12.54|13.12|{color:green}4.6%{color}|
|united~1.0|10.61|11.12|{color:green}4.8%{color}|
|united~2.0|2.52|2.65|{color:green}5.1%{color}|
|state|28.72|30.23|{color:green}5.3%{color}|
|un*d|44.84|48.06|{color:green}7.2%{color}|
|u*d|13.17|14.51|{color:green}10.2%{color}|
In the bulk postings branch, I've been experimenting with various techniques
for FOR/PFOR
and one thing i tried was simply decoding with readInt() from the DataInput. So
I adapted For/PFOR
to just take DataInput and work on it directly, instead of reading into a
byte[], wrapping it with a ByteBuffer,
and working on an IntBuffer view.
But when I did this, i found that MMap was slow for readInt(), etc. So we
implement these primitives
with ByteBuffer.readInt(). This isn't very important since lucene doesn't much
use these, and mostly theoretical
but I still think things like readInt(), readShort(), readLong() should be
fast... for example just earlier today
someone posted an alternative PFOR implementation on LUCENE-1410 that uses
DataInput.readInt().
MMapIndexInput readInt() improvements [bulkpostings, FrameOfRefDataInput codec]
||Query||QPS branch||QPS patch||Pct diff||||
|spanFirst(unit, 5)|12.14|11.99|{color:red}-1.2%{color}|
|united~1.0|11.32|11.33|{color:green}0.1%{color}|
|united~2.0|2.51|2.56|{color:green}2.1%{color}|
|unit~1.0|6.98|7.19|{color:green}3.0%{color}|
|unit~2.0|6.88|7.11|{color:green}3.3%{color}|
|spanNear([unit, state], 10, true)|2.81|2.96|{color:green}5.2%{color}|
|unit state|8.04|8.59|{color:green}6.8%{color}|
|+unit +state|10.97|12.12|{color:green}10.5%{color}|
|unit*|26.67|29.80|{color:green}11.7%{color}|
|"unit state"|5.59|6.27|{color:green}12.3%{color}|
|uni*|15.10|17.51|{color:green}15.9%{color}|
|state|33.20|38.72|{color:green}16.6%{color}|
|+nebraska +state|59.17|71.45|{color:green}20.8%{color}|
|un*d|35.98|47.14|{color:green}31.0%{color}|
|u*d|9.48|12.46|{color:green}31.4%{color}|
Here's the same benchmark of DataInput.readInt() but with the
MultiMMapIndexInput
MultiMMapIndexInput readInt() improvements [bulkpostings, FrameOfRefDataInput
codec]
||Query||QPS branch||QPS patch||Pct diff||||
|united~2.0|2.43|2.54|{color:green}4.3%{color}|
|united~1.0|10.78|11.39|{color:green}5.7%{color}|
|unit~1.0|6.81|7.21|{color:green}5.8%{color}|
|unit~2.0|6.62|7.05|{color:green}6.5%{color}|
|spanNear([unit, state], 10, true)|2.77|2.96|{color:green}6.6%{color}|
|unit state|7.85|8.53|{color:green}8.7%{color}|
|spanFirst(unit, 5)|10.50|11.71|{color:green}11.5%{color}|
|+unit +state|10.26|11.94|{color:green}16.3%{color}|
|"unit state"|5.39|6.31|{color:green}17.0%{color}|
|state|31.95|39.17|{color:green}22.6%{color}|
|unit*|24.39|31.02|{color:green}27.2%{color}|
|+nebraska +state|54.68|71.98|{color:green}31.6%{color}|
|u*d|9.53|12.62|{color:green}32.5%{color}|
|uni*|13.72|18.23|{color:green}32.9%{color}|
|un*d|35.87|48.19|{color:green}34.3%{color}|
Just to be sure, I ran this last one on sparc64 (bigendian) also.
MultiMMapIndexInput readInt() improvements [bulkpostings, FrameOfRefDataInput
codec]
||Query||QPS branch||QPS patch||Pct diff||||
|united~2.0|2.23|2.26|{color:green}1.5%{color}|
|unit~2.0|6.37|6.47|{color:green}1.6%{color}|
|united~1.0|11.33|11.59|{color:green}2.3%{color}|
|unit~1.0|9.68|10.05|{color:green}3.7%{color}|
|spanNear([unit, state], 10, true)|15.60|17.54|{color:green}12.5%{color}|
|unit*|127.14|144.08|{color:green}13.3%{color}|
|unit state|44.93|51.30|{color:green}14.2%{color}|
|spanFirst(unit, 5)|58.42|68.37|{color:green}17.0%{color}|
|uni*|56.66|67.53|{color:green}19.2%{color}|
|+nebraska +state|215.62|262.99|{color:green}22.0%{color}|
|+unit +state|63.18|77.86|{color:green}23.2%{color}|
|"unit state"|32.24|40.05|{color:green}24.2%{color}|
|u*d|29.13|36.69|{color:green}26.0%{color}|
|state|145.99|188.33|{color:green}29.0%{color}|
|un*d|65.27|87.20|{color:green}33.6%{color}|
I think some of these benchmarks also show that MultiMMapIndexInput might now be
essentially just as fast as MMapIndexInput... but lets not go there yet and
keep them separate for now.
> MMapDirectory speedups
> ----------------------
>
> Key: LUCENE-2816
> URL: https://issues.apache.org/jira/browse/LUCENE-2816
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Store
> Affects Versions: 3.1, 4.0
> Reporter: Robert Muir
> Assignee: Robert Muir
> Attachments: LUCENE-2816.patch
>
>
> MMapDirectory has some performance problems:
> # When the file is larger than Integer.MAX_VALUE, we use MultiMMapIndexInput,
> which does a lot of unnecessary bounds-checks for its buffer-switching etc.
> Instead, like MMapIndexInput, it should rely upon the contract of these
> operations
> in ByteBuffer (which will do a bounds check always and throw
> BufferUnderflowException).
> Our 'buffer' is so large (Integer.MAX_VALUE) that its rare this happens and
> doing
> our own bounds checks just slows things down.
> # the readInt()/readLong()/readShort() are slow and should just defer to
> ByteBuffer.readInt(), etc
> This isn't very important since we don't much use these, but I think there's
> no reason
> users (e.g. codec writers) should have to readBytes() + wrap as bytebuffer +
> get an
> IntBuffer view when readInt() can be almost as fast...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]