[GitHub] [iceberg] jackye1995 commented on pull request #7128: Core: Optimize S3 layout of Datafiles by expanding first character set of the hash

via GitHub Fri, 17 Mar 2023 09:07:06 -0700


jackye1995 commented on PR #7128:
URL: https://github.com/apache/iceberg/pull/7128#issuecomment-1474064817


   @danielcweeks this comes out of a test we did with one of our biggest 
customers, where the current object storage layout still resulted in 
throttling, but by extending it we were able to eliminate throttling. 
   
   There are 2 issues with the current layout, here are some data we collected 
internally:
   
   1. The left padding is an area of concern.  If the hex-encoded version of 
the hash is not always 8 characters long, then there will be a large number of 
entries that start with one or more “0” characters, which could create uneven 
heat distribution for S3.  To test this theory we use UUID to simulate file 
paths, generate 10M random entries, and then apply the algorithm above.  We can 
plot the histogram of the length of a non-padded, hex-encoded entropy:
   
   ```
   length[1] = 0 
   length[2] = 2 
   length[3] = 24 
   length[4] = 281 
   length[5] = 4589 
   length[6] = 73243 
   length[7] = 1173442 
   length[8] = 8748419 
   ```
   
   This indicates that ~12.5% of all file paths will start with “0”, which for 
a hex-encoded string should have been 6.25% since there are 16 characters in 
the set.  
   
   2. All hex-encodings start with the inclusive range 0-8:
   
   ```
   count(partition["0"]) = 1249993 
   count(partition["1"]) = 1250614 
   count(partition["2"]) = 1250830 
   count(partition["3"]) = 1250695 
   count(partition["4"]) = 1249028 
   count(partition["5"]) = 1249782 
   count(partition["6"]) = 1249436 
   count(partition["7"]) = 1249622 
   ```
   
   The issue here is that the default algorithm generates values up to 
`Integer.MAX_VALUE`, which in Java is `(2^31)-1`.  When this is converted to 
hex, it produces 0x7FFFFFFF, so it’s understandable why the first half of the 
range is restricted to 8 characters using this default algorithm.
   
   With that being said, maybe we do not need all the character range to solve 
it. An alternative approach is to use something like a reverse hex, which we 
have also tested to be working. But this approach will maximize the entropy in 
any cases without the need to worry about S3 internals. I will ask S3 team to 
take a look and see if the current approach is necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on pull request #7128: Core: Optimize S3 layout of Datafiles by expanding first character set of the hash

Reply via email to