singhpk234 opened a new pull request, #7128:
URL: https://github.com/apache/iceberg/pull/7128
### About the change
Presently we use "%08x" to get, implies "it will produce a 8 digits hex
number, padded by preceding zeros". This effectively means the distribution
will be skewed, also since we are relying on hex number our character set is
any ways limited to [0-9][A-F].
This change attempts to use a wider character set as well as meantime making
sure the distribution of first character remains as much uniform as possible.
Sample code for distribution :
```
@Test
public void distributionOfFirstChar() {
Function<Object, Integer> HASH_FUNC =
Transforms.bucket(Integer.MAX_VALUE).bind(Types.StringType.get());
Map<String, Integer> hm = Maps.newHashMap();
for (int i = 0; i < 1000000; ++i) {
String randomUUID = UUID.randomUUID().toString();
//String hashFunc = String.format("%08x", HASH_FUNC.apply(randomUUID));
String hashFunc = HashUtils.computeHash(randomUUID);
String firstChar = hashFunc.substring(0, 1);
hm.put(firstChar, (hm.getOrDefault(firstChar, 0) + 1));
}
for (String key : hm.keySet()) {
System.out.println(String.format("hm[%s] = %s", key, hm.get(key)));
}
}
```
Distribution of first character before (10M UUID String) :
it's being restricted to only [0-7]
hm[0] = 125099
hm[1] = 124953
hm[2] = 125440
hm[3] = 124705
hm[4] = 124777
hm[5] = 125103
hm[6] = 124908
hm[7] = 125015
Distribution of first character after this change (10M UUID String):
hm[0] = 15715
hm[1] = 15524
hm[2] = 15861
hm[3] = 15680
hm[4] = 15411
hm[5] = 15638
hm[6] = 19410
hm[7] = 19298
hm[8] = 19472
hm[9] = 19399
hm[A] = 15661
hm[B] = 15633
hm[C] = 15414
hm[D] = 15675
hm[E] = 15711
hm[F] = 15569
hm[G] = 15767
hm[H] = 15643
hm[I] = 15616
hm[J] = 15508
hm[K] = 15636
hm[L] = 15726
hm[M] = 15701
hm[N] = 15658
hm[O] = 15525
hm[P] = 15646
hm[Q] = 15686
hm[R] = 15666
hm[S] = 15675
hm[T] = 15521
hm[U] = 15569
hm[V] = 15613
hm[W] = 15398
hm[X] = 15797
hm[Y] = 15855
hm[Z] = 15620
hm[a] = 19318
hm[b] = 19460
hm[c] = 19579
hm[d] = 19673
hm[e] = 15790
hm[f] = 15687
hm[g] = 15622
hm[h] = 15833
hm[i] = 15693
hm[j] = 15547
hm[k] = 15725
hm[l] = 15521
hm[m] = 15911
hm[n] = 15468
hm[o] = 15579
hm[p] = 15753
hm[q] = 15594
hm[r] = 15723
hm[s] = 15628
hm[t] = 15433
hm[u] = 15645
hm[v] = 15544
hm[w] = 15761
hm[x] = 15524
hm[y] = 15565
hm[z] = 15527
More resources :
1.
https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]