Re: [PR] HDDS-12779. Parallelize table iteration across different ranges [ozone]

via GitHub Sun, 13 Apr 2025 22:23:44 -0700


swamirishi commented on code in PR #8243:
URL: https://github.com/apache/ozone/pull/8243#discussion_r2041413386



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/metadata/BigIntegerCodec.java:
##########
@@ -27,6 +29,7 @@
 public final class BigIntegerCodec implements Codec<BigInteger> {
 
   private static final Codec<BigInteger> INSTANCE = new BigIntegerCodec();
+  private static final Comparator<BigInteger> COMPARATOR = (o1, o2) -> 
Objects.compare(o1, o2, BigInteger::compareTo);

Review Comment:
   @szetszwo @kerneltime I was able to repro with UTF_16LE.
   ```
     @Test
     public void testMinHeapIterator() throws IOException, 
InterruptedException, TimeoutException {
       OzoneConfiguration ozoneConfiguration = new OzoneConfiguration();
       ozoneConfiguration.setInt(OZONE_OM_SNAPSHOT_DB_MAX_OPEN_FILES, -1);
       ozoneConfiguration.set(OMConfigKeys.OZONE_OM_DB_DIRS, 
storageDir.getAbsolutePath());
   
       String str1 = "\u00FF1"; 
       String str2 = "\uFF002";
       String str3 = "\uFF003";
       System.out.println(String.format("Str1(%s) Characters  \t:", str1) +
           IntStream.range(0, 
str1.length()).map(str1::charAt).boxed().collect(Collectors.toList()));
       System.out.println(String.format("Str2(%s) Characters \t:", str2) +
           IntStream.range(0, 
str2.length()).map(str2::charAt).boxed().collect(Collectors.toList()));
       System.out.println(String.format("Str3(%s) Characters \t:", str3) +
           IntStream.range(0, 
str3.length()).map(str3::charAt).boxed().collect(Collectors.toList()));
       System.out.println("Str1 Bytes \t:" +
           
Arrays.toString(getIntArray(str1.getBytes(StandardCharsets.UTF_16LE))));
       System.out.println("Str2 Bytes \t:" +
           
Arrays.toString(getIntArray(str2.getBytes(StandardCharsets.UTF_16LE))));
       System.out.println("Str3 Bytes \t:" +
           
Arrays.toString(getIntArray(str3.getBytes(StandardCharsets.UTF_16LE))));
   
       List<String> sortedByByteArray1 = Stream.of(str1, str2, str3)
           .sorted((o1, o2) -> 
compareByteArrays(o1.getBytes(StandardCharsets.UTF_16LE),
               
o2.getBytes(StandardCharsets.UTF_16LE))).collect(Collectors.toList());
       List<String> sortedLexically1 = Stream.of(str1, str2, 
str3).sorted().collect(Collectors.toList());
       System.out.println("Key Sorted Lexically \t" + sortedLexically1);
       System.out.println("Key Sorted By Byte Array \t" + sortedByByteArray1);
   
   
       OmMetadataManagerImpl omMetadataManager = new 
OmMetadataManagerImpl(ozoneConfiguration, null);
       Table<String, OmKeyInfo> table = 
omMetadataManager.getStore().getTable("fileTable", StringUTF16LECodec.get(),
           OmKeyInfo.getCodec(true), TableCache.CacheType.PARTIAL_CACHE);
       int i = 1;
       for (String key : sortedByByteArray1) {
         table.addCacheEntry(key, getOmKeyInfo("key" , i), 0);
         table.put(key, getOmKeyInfo("key", i));
         i++;
       }
   
       System.out.println("============ Before Flushing Cache StartKey is first 
key "+ sortedLexically1.get(0) +"    " +
           "===========");
       ListIterator.MinHeapIterator minHeapIterator =
           new ListIterator.MinHeapIterator(omMetadataManager, "",
           sortedLexically1.get(0), "vol",
           "buck", table);
       while (minHeapIterator.hasNext()) {
         ListIterator.HeapEntry entry = minHeapIterator.next();
         System.out.println(entry.getKey() + "\t" + 
((OmKeyInfo)entry.getValue()).getKeyName());
       }
   
       table.cleanupCache(Collections.singletonList(0L));
       GenericTestUtils.waitFor(() -> {
         System.out.println("Flushing Cache.....");
         return !table.cacheIterator().hasNext();
       }, 100, 10000);
       System.out.println("============ After Flushing Cache StartKey is first 
key "+ sortedLexically1.get(0) +
           " ===========");
       minHeapIterator = new ListIterator.MinHeapIterator(omMetadataManager, "",
               sortedLexically1.get(0), "vol",
               "buck", table);
       while (minHeapIterator.hasNext()) {
         ListIterator.HeapEntry entry = minHeapIterator.next();
         System.out.println(entry.getKey() + "\t" + 
((OmKeyInfo)entry.getValue()).getKeyName());
       }
     }
   ```
   <img width="1171" alt="Screenshot 2025-04-13 at 21 57 04" 
src="https://github.com/user-attachments/assets/a94c6bda-91cc-4f1e-aaf2-af29f3216bcc";
 />
   
   
   In this example I wrote a new StringCodec which serializes and deserializes 
in UTF_16LE where the rocksdb bytewise comparator will not match the ordering 
of the table as you said. But if you see this case where we have cache and 
table iteration which happens together for listing we are breaking listing for 
such codecs. So in such cases it becomes imperative that we have a comparator 
in the Codec class which acutally matches the same ordering as the underlying 
rocksdb's BytewiseComparator or we serialize and deserialize the object in such 
a way that the byte ordering of the object matches the object.compareTo() 
either way is fine. For example String UTF8 will exactly match the Bytewise 
unsigned int comparison. In such cases we can optimize by directly calling 
string.compareTo instead of converting to byte array and then do unsigned int 
comparison. As of date we haven't seen such issues in listing because we are 
using UTF8 which is fine. If we have similar implementation for compari
 ng longs/ints then it will break for negative longs/ints since MSB will be 1 
in that case rocksdb will have all negative numbers at the end. We have similar 
2 pointer iterator implementation in recon and from what I understand we have 
such implementation spread across multiple places in the code bases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-12779. Parallelize table iteration across different ranges [ozone]

Reply via email to