[GitHub] [druid] gianm commented on a diff in pull request #13955: Always use file sizes when determining batch ingest splits

via GitHub Mon, 27 Mar 2023 15:15:41 -0700


gianm commented on code in PR #13955:
URL: https://github.com/apache/druid/pull/13955#discussion_r1149839145



##########
extensions-core/google-extensions/src/main/java/org/apache/druid/data/input/google/GoogleCloudStorageInputSource.java:
##########
@@ -79,73 +71,63 @@ protected InputEntity createEntity(CloudObjectLocation 
location)
     return new GoogleCloudStorageEntity(storage, location);
   }
 
-  @Override
-  protected Stream<InputSplit<List<CloudObjectLocation>>> 
getPrefixesSplitStream(@Nonnull SplitHintSpec splitHintSpec)
-  {
-    final Iterator<List<StorageObject>> splitIterator = splitHintSpec.split(
-        storageObjectIterable().iterator(),
-        storageObject -> {
-          final BigInteger sizeInBigInteger = storageObject.getSize();
-          long sizeInLong;
-          if (sizeInBigInteger == null) {
-            sizeInLong = Long.MAX_VALUE;
-          } else {
-            try {
-              sizeInLong = sizeInBigInteger.longValueExact();
-            }
-            catch (ArithmeticException e) {
-              LOG.warn(
-                  e,
-                  "The object [%s, %s] has a size [%s] out of the range of the 
long type. "
-                  + "The max long value will be used for its size instead.",
-                  storageObject.getBucket(),
-                  storageObject.getName(),
-                  sizeInBigInteger
-              );
-              sizeInLong = Long.MAX_VALUE;
-            }
-          }
-          return new InputFileAttribute(sizeInLong);
-        }
-    );
-
-    return Streams.sequentialStreamFrom(splitIterator)
-                  .map(objects -> 
objects.stream().map(this::byteSourceFromStorageObject).collect(Collectors.toList()))
-                  .map(InputSplit::new);
-  }
-
   @Override
   public SplittableInputSource<List<CloudObjectLocation>> 
withSplit(InputSplit<List<CloudObjectLocation>> split)
   {
     return new GoogleCloudStorageInputSource(storage, inputDataConfig, null, 
null, split.get(), getObjectGlob());
   }
 
-  private CloudObjectLocation byteSourceFromStorageObject(final StorageObject 
storageObject)
+  @Override
+  protected CloudObjectSplitWidget getSplitWidget()
   {
-    return GoogleUtils.objectToCloudObjectLocation(storageObject);
+    class SplitWidget implements CloudObjectSplitWidget
+    {
+      @Override
+      public Iterator<LocationWithSize> 
getDescriptorIteratorForPrefixes(List<URI> prefixes)
+      {
+        return Iterators.transform(
+            GoogleUtils.lazyFetchingStorageObjectsIterator(
+                storage,
+                prefixes.iterator(),
+                inputDataConfig.getMaxListingLength()
+            ),
+            object -> new LocationWithSize(object.getBucket(), 
object.getName(), getSize(object))
+        );
+      }
+
+      @Override
+      public long getObjectSize(CloudObjectLocation location) throws 
IOException
+      {
+        final StorageObject storageObject = 
storage.getMetadata(location.getBucket(), location.getPath());
+        return getSize(storageObject);
+      }
+    }
+
+    return new SplitWidget();
   }
 
-  private Iterable<StorageObject> storageObjectIterable()
+  private static long getSize(final StorageObject object)
   {
-    return () -> {
-      Iterator<StorageObject> iterator = 
GoogleUtils.lazyFetchingStorageObjectsIterator(
-          storage,
-          getPrefixes().iterator(),
-          inputDataConfig.getMaxListingLength()
-      );
+    final BigInteger sizeInBigInteger = object.getSize();
 
-      // Skip files that didn't match glob filter.
-      if (StringUtils.isNotBlank(getObjectGlob())) {
-        PathMatcher m = FileSystems.getDefault().getPathMatcher("glob:" + 
getObjectGlob());
-
-        iterator = Iterators.filter(
-            iterator,
-            object -> m.matches(Paths.get(object.getName()))
+    if (sizeInBigInteger == null) {
+      return Long.MAX_VALUE;

Review Comment:
   TLDR: I suspect this code path never actually executes.
   
   > Given how the size is now used, is this the right answer? A naive splitter 
will create a huge number of splits for a file of size 2^63.
   
   We don't make partial-file splits, so this isn't a concern.
   
   > Under what circumstances would the size be unknown?
   
   ☝️ Fwiw, this is the main reason I didn't change the logic (the code is 
pre-existing; I moved it due to a refactor, but it's otherwise the same as what 
was there before). I didn't know when the size would be unknown, and hadn't 
spent time looking into it.
   
   However, spending a couple minutes now, I found a couple of things:
   
   - In the code review for the PR that added the logic 
(https://github.com/apache/druid/pull/9360/files#r382871722), there was a 
mention of the possibility of the size overflowing a signed long. That isn't 
realistic: an object can't be that big. So I don't think the 
`ArithmeticException` code path is really going to happen.
   - Google's documentation for `StorageObject#getSize()` says it returns 
"value or `null` for none". However, it doesn't say when the size might be 
`null`. Neither does the documentation for the JSON API that it comes from: 
https://cloud.google.com/storage/docs/json_api/v1/objects. (There is no mention 
of `size` perhaps being `null` or missing.) It's possible it's never actually 
`null`.
   
   It seems likely to me that neither of these things actually happens in 
production… however, I don't really want to find out by removing this stuff in 
this PR 🙂. Would be reasonable to remove it in a separate PR, though. That way, 
if the analysis here is wrong, we can easily revert that separate commit.



##########
extensions-core/google-extensions/src/main/java/org/apache/druid/data/input/google/GoogleCloudStorageInputSource.java:
##########
@@ -79,73 +71,63 @@ protected InputEntity createEntity(CloudObjectLocation 
location)
     return new GoogleCloudStorageEntity(storage, location);
   }
 
-  @Override
-  protected Stream<InputSplit<List<CloudObjectLocation>>> 
getPrefixesSplitStream(@Nonnull SplitHintSpec splitHintSpec)
-  {
-    final Iterator<List<StorageObject>> splitIterator = splitHintSpec.split(
-        storageObjectIterable().iterator(),
-        storageObject -> {
-          final BigInteger sizeInBigInteger = storageObject.getSize();
-          long sizeInLong;
-          if (sizeInBigInteger == null) {
-            sizeInLong = Long.MAX_VALUE;
-          } else {
-            try {
-              sizeInLong = sizeInBigInteger.longValueExact();
-            }
-            catch (ArithmeticException e) {
-              LOG.warn(
-                  e,
-                  "The object [%s, %s] has a size [%s] out of the range of the 
long type. "
-                  + "The max long value will be used for its size instead.",
-                  storageObject.getBucket(),
-                  storageObject.getName(),
-                  sizeInBigInteger
-              );
-              sizeInLong = Long.MAX_VALUE;
-            }
-          }
-          return new InputFileAttribute(sizeInLong);
-        }
-    );
-
-    return Streams.sequentialStreamFrom(splitIterator)
-                  .map(objects -> 
objects.stream().map(this::byteSourceFromStorageObject).collect(Collectors.toList()))
-                  .map(InputSplit::new);
-  }
-
   @Override
   public SplittableInputSource<List<CloudObjectLocation>> 
withSplit(InputSplit<List<CloudObjectLocation>> split)
   {
     return new GoogleCloudStorageInputSource(storage, inputDataConfig, null, 
null, split.get(), getObjectGlob());
   }
 
-  private CloudObjectLocation byteSourceFromStorageObject(final StorageObject 
storageObject)
+  @Override
+  protected CloudObjectSplitWidget getSplitWidget()
   {
-    return GoogleUtils.objectToCloudObjectLocation(storageObject);
+    class SplitWidget implements CloudObjectSplitWidget
+    {
+      @Override
+      public Iterator<LocationWithSize> 
getDescriptorIteratorForPrefixes(List<URI> prefixes)
+      {
+        return Iterators.transform(
+            GoogleUtils.lazyFetchingStorageObjectsIterator(
+                storage,
+                prefixes.iterator(),
+                inputDataConfig.getMaxListingLength()
+            ),
+            object -> new LocationWithSize(object.getBucket(), 
object.getName(), getSize(object))
+        );
+      }
+
+      @Override
+      public long getObjectSize(CloudObjectLocation location) throws 
IOException
+      {
+        final StorageObject storageObject = 
storage.getMetadata(location.getBucket(), location.getPath());
+        return getSize(storageObject);
+      }
+    }
+
+    return new SplitWidget();
   }
 
-  private Iterable<StorageObject> storageObjectIterable()
+  private static long getSize(final StorageObject object)
   {
-    return () -> {
-      Iterator<StorageObject> iterator = 
GoogleUtils.lazyFetchingStorageObjectsIterator(
-          storage,
-          getPrefixes().iterator(),
-          inputDataConfig.getMaxListingLength()
-      );
+    final BigInteger sizeInBigInteger = object.getSize();
 
-      // Skip files that didn't match glob filter.
-      if (StringUtils.isNotBlank(getObjectGlob())) {
-        PathMatcher m = FileSystems.getDefault().getPathMatcher("glob:" + 
getObjectGlob());
-
-        iterator = Iterators.filter(
-            iterator,
-            object -> m.matches(Paths.get(object.getName()))
+    if (sizeInBigInteger == null) {
+      return Long.MAX_VALUE;

Review Comment:
   TLDR: I suspect this code path never actually executes.
   
   > Given how the size is now used, is this the right answer? A naive splitter 
will create a huge number of splits for a file of size 2^63.
   
   We don't make partial-file splits, so this isn't a concern.
   
   > Under what circumstances would the size be unknown?
   
   ☝️ Fwiw, this is the main reason I didn't change the logic (the code is 
pre-existing; I moved it due to a refactor, but it's otherwise the same as what 
was there before). I didn't know when the size would be unknown, and hadn't 
spent time looking into it.
   
   However, spending a couple minutes now, I found a couple of things:
   
   - In the code review for the PR that added the logic 
(https://github.com/apache/druid/pull/9360/files#r382871722), there was a 
mention of the possibility of the size overflowing a signed long. That isn't 
realistic: an object can't be that big. So I don't think the 
`ArithmeticException` code path is really going to happen.
   - Google's documentation for `StorageObject#getSize()` says it returns 
"value or `null` for none". However, it doesn't say when the size might be 
`null`. Neither does the documentation for the JSON API that it comes from: 
https://cloud.google.com/storage/docs/json_api/v1/objects. In fact, there is no 
mention of `size` perhaps being `null` or missing. So it's possible it's never 
actually `null`.
   
   It seems likely to me that neither of these things actually happens in 
production… however, I don't really want to find out by removing this stuff in 
this PR 🙂. Would be reasonable to remove it in a separate PR, though. That way, 
if the analysis here is wrong, we can easily revert that separate commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm commented on a diff in pull request #13955: Always use file sizes when determining batch ingest splits

Reply via email to