[ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704246#action_12704246 ]
Patrick Eger commented on LUCENE-1607: -------------------------------------- As a quick comment on the implementation, i notice that it is possible (with reasonable probability) for hash collisions to result in re-interning a pair of strings multiple times. For example, a codepath that traverses across 32 unique string datapoints (say, in an inner loop somewhere), would have a minimum 3% probability of colliding and re-interning 2 strings every time through the loop. Because of the birthday paradox, it becomes likely to have such a situation (50% probability with ~150 unique values). These are the probabilities of collision, assuming random distribution and perfect hashing. In real life the distribution will not be so random (string.hashCode() & MASK) so these will be "best case". 2 datapoints: collision prob = 0.006104% 4 datapoints: collision prob = 0.036617% 8 datapoints: collision prob = 0.170779% 16 datapoints: collision prob = 0.729976% 32 datapoints: collision prob = 2.983863% 64 datapoints: collision prob = 11.591861% 128 datapoints: collision prob = 39.188158% 256 datapoints: collision prob = 86.501947% Practically this may or may not matter. My thought is that some sort of fast LRU structure would be better, but of course creating something like this without locking may be tricky. Another idea might be to support some form of limited hash-chaining or probing in the table, which would mitigate the damage of a collision significantly. for reference, python code for calculating birthday/hash collisions, shamelessly stolen: -------------------- def bp(n, d): v = 1.0 for i in range(n): v = v * (1 - float(i)/d) return 1 - v for n in [2, 4, 8, 16, 32, 64, 128, 256]: print "%i datapoints: collision prob = %f%%" % (n, bp(n, 16*1024)*100) > String.intern() faster alternative > ---------------------------------- > > Key: LUCENE-1607 > URL: https://issues.apache.org/jira/browse/LUCENE-1607 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Earwin Burrfoot > Fix For: 2.9 > > Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, > LUCENE-1607.patch, LUCENE-1607.patch > > > By using our own interned string pool on top of default, String.intern() can > be greatly optimized. > On my setup (java 6) this alternative runs ~15.8x faster for already interned > strings, and ~2.2x faster for 'new String(interned)' > For java 5 and 4 speedup is lower, but still considerable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org