[ 
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704246#action_12704246
 ] 

Patrick Eger commented on LUCENE-1607:
--------------------------------------

As a quick comment on the implementation, i notice that it is possible (with 
reasonable probability) for hash collisions to result in re-interning a pair of 
strings multiple times. For example, a codepath that traverses across 32 unique 
string datapoints (say, in an inner loop somewhere), would have a minimum 3% 
probability of colliding and re-interning 2 strings every time through the 
loop. Because of the birthday paradox, it becomes likely to have such a 
situation (50% probability with ~150 unique values).

These are the probabilities of collision, assuming random distribution and 
perfect hashing. In real life the distribution will not be so random 
(string.hashCode() & MASK) so these will be "best case".
2 datapoints: collision prob = 0.006104%
4 datapoints: collision prob = 0.036617%
8 datapoints: collision prob = 0.170779%
16 datapoints: collision prob = 0.729976%
32 datapoints: collision prob = 2.983863%
64 datapoints: collision prob = 11.591861%
128 datapoints: collision prob = 39.188158%
256 datapoints: collision prob = 86.501947%


Practically this may or may not matter. My thought is that some sort of fast 
LRU structure would be better, but of course creating something like this 
without locking may be tricky. Another idea might be to support some form of 
limited hash-chaining or probing in the table, which would mitigate the damage 
of a collision significantly.


for reference, python code for calculating birthday/hash collisions, 
shamelessly stolen:
--------------------
def bp(n, d):
        v = 1.0
        for i in range(n):
                v = v * (1 - float(i)/d)
        return 1 - v

for n in [2, 4, 8, 16, 32, 64, 128, 256]:
        print "%i datapoints: collision prob = %f%%" % (n, bp(n, 16*1024)*100)

> String.intern() faster alternative
> ----------------------------------
>
>                 Key: LUCENE-1607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1607
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Earwin Burrfoot
>             Fix For: 2.9
>
>         Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
> LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can 
> be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned 
> strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to