[
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704246#action_12704246
]
Patrick Eger commented on LUCENE-1607:
--------------------------------------
As a quick comment on the implementation, i notice that it is possible (with
reasonable probability) for hash collisions to result in re-interning a pair of
strings multiple times. For example, a codepath that traverses across 32 unique
string datapoints (say, in an inner loop somewhere), would have a minimum 3%
probability of colliding and re-interning 2 strings every time through the
loop. Because of the birthday paradox, it becomes likely to have such a
situation (50% probability with ~150 unique values).
These are the probabilities of collision, assuming random distribution and
perfect hashing. In real life the distribution will not be so random
(string.hashCode() & MASK) so these will be "best case".
2 datapoints: collision prob = 0.006104%
4 datapoints: collision prob = 0.036617%
8 datapoints: collision prob = 0.170779%
16 datapoints: collision prob = 0.729976%
32 datapoints: collision prob = 2.983863%
64 datapoints: collision prob = 11.591861%
128 datapoints: collision prob = 39.188158%
256 datapoints: collision prob = 86.501947%
Practically this may or may not matter. My thought is that some sort of fast
LRU structure would be better, but of course creating something like this
without locking may be tricky. Another idea might be to support some form of
limited hash-chaining or probing in the table, which would mitigate the damage
of a collision significantly.
for reference, python code for calculating birthday/hash collisions,
shamelessly stolen:
--------------------
def bp(n, d):
v = 1.0
for i in range(n):
v = v * (1 - float(i)/d)
return 1 - v
for n in [2, 4, 8, 16, 32, 64, 128, 256]:
print "%i datapoints: collision prob = %f%%" % (n, bp(n, 16*1024)*100)
> String.intern() faster alternative
> ----------------------------------
>
> Key: LUCENE-1607
> URL: https://issues.apache.org/jira/browse/LUCENE-1607
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Earwin Burrfoot
> Fix For: 2.9
>
> Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch,
> LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can
> be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned
> strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]