gianm commented on code in PR #13955:
URL: https://github.com/apache/druid/pull/13955#discussion_r1149839145
##########
extensions-core/google-extensions/src/main/java/org/apache/druid/data/input/google/GoogleCloudStorageInputSource.java:
##########
@@ -79,73 +71,63 @@ protected InputEntity createEntity(CloudObjectLocation
location)
return new GoogleCloudStorageEntity(storage, location);
}
- @Override
- protected Stream<InputSplit<List<CloudObjectLocation>>>
getPrefixesSplitStream(@Nonnull SplitHintSpec splitHintSpec)
- {
- final Iterator<List<StorageObject>> splitIterator = splitHintSpec.split(
- storageObjectIterable().iterator(),
- storageObject -> {
- final BigInteger sizeInBigInteger = storageObject.getSize();
- long sizeInLong;
- if (sizeInBigInteger == null) {
- sizeInLong = Long.MAX_VALUE;
- } else {
- try {
- sizeInLong = sizeInBigInteger.longValueExact();
- }
- catch (ArithmeticException e) {
- LOG.warn(
- e,
- "The object [%s, %s] has a size [%s] out of the range of the
long type. "
- + "The max long value will be used for its size instead.",
- storageObject.getBucket(),
- storageObject.getName(),
- sizeInBigInteger
- );
- sizeInLong = Long.MAX_VALUE;
- }
- }
- return new InputFileAttribute(sizeInLong);
- }
- );
-
- return Streams.sequentialStreamFrom(splitIterator)
- .map(objects ->
objects.stream().map(this::byteSourceFromStorageObject).collect(Collectors.toList()))
- .map(InputSplit::new);
- }
-
@Override
public SplittableInputSource<List<CloudObjectLocation>>
withSplit(InputSplit<List<CloudObjectLocation>> split)
{
return new GoogleCloudStorageInputSource(storage, inputDataConfig, null,
null, split.get(), getObjectGlob());
}
- private CloudObjectLocation byteSourceFromStorageObject(final StorageObject
storageObject)
+ @Override
+ protected CloudObjectSplitWidget getSplitWidget()
{
- return GoogleUtils.objectToCloudObjectLocation(storageObject);
+ class SplitWidget implements CloudObjectSplitWidget
+ {
+ @Override
+ public Iterator<LocationWithSize>
getDescriptorIteratorForPrefixes(List<URI> prefixes)
+ {
+ return Iterators.transform(
+ GoogleUtils.lazyFetchingStorageObjectsIterator(
+ storage,
+ prefixes.iterator(),
+ inputDataConfig.getMaxListingLength()
+ ),
+ object -> new LocationWithSize(object.getBucket(),
object.getName(), getSize(object))
+ );
+ }
+
+ @Override
+ public long getObjectSize(CloudObjectLocation location) throws
IOException
+ {
+ final StorageObject storageObject =
storage.getMetadata(location.getBucket(), location.getPath());
+ return getSize(storageObject);
+ }
+ }
+
+ return new SplitWidget();
}
- private Iterable<StorageObject> storageObjectIterable()
+ private static long getSize(final StorageObject object)
{
- return () -> {
- Iterator<StorageObject> iterator =
GoogleUtils.lazyFetchingStorageObjectsIterator(
- storage,
- getPrefixes().iterator(),
- inputDataConfig.getMaxListingLength()
- );
+ final BigInteger sizeInBigInteger = object.getSize();
- // Skip files that didn't match glob filter.
- if (StringUtils.isNotBlank(getObjectGlob())) {
- PathMatcher m = FileSystems.getDefault().getPathMatcher("glob:" +
getObjectGlob());
-
- iterator = Iterators.filter(
- iterator,
- object -> m.matches(Paths.get(object.getName()))
+ if (sizeInBigInteger == null) {
+ return Long.MAX_VALUE;
Review Comment:
TLDR: I suspect this code path never actually executes.
> Given how the size is now used, is this the right answer? A naive splitter
will create a huge number of splits for a file of size 2^63.
We don't make partial-file splits, so this isn't a concern.
> Under what circumstances would the size be unknown?
☝️ Fwiw, this is the main reason I didn't change the logic (the code is
pre-existing; I moved it due to a refactor, but it's otherwise the same as what
was there before). I didn't know when the size would be unknown, and hadn't
spent time looking into it.
However, spending a couple minutes now, I found a couple of things:
- In the code review for the PR that added the logic
(https://github.com/apache/druid/pull/9360/files#r382871722), there was a
mention of the possibility of the size overflowing a signed long. That isn't
realistic: an object can't be that big. So I don't think the
`ArithmeticException` code path is really going to happen.
- Google's documentation for `StorageObject#getSize()` says it returns
"value or `null` for none". However, it doesn't say when the size might be
`null`. Neither does the documentation for the JSON API that it comes from:
https://cloud.google.com/storage/docs/json_api/v1/objects. (There is no mention
of `size` perhaps being `null` or missing.) It's possible it's never actually
`null`.
It seems likely to me that neither of these things actually happens in
production… however, I don't really want to find out by removing this stuff in
this PR 🙂. Would be reasonable to remove it in a separate PR, though. That way,
if the analysis here is wrong, we can easily revert that separate commit.
##########
extensions-core/google-extensions/src/main/java/org/apache/druid/data/input/google/GoogleCloudStorageInputSource.java:
##########
@@ -79,73 +71,63 @@ protected InputEntity createEntity(CloudObjectLocation
location)
return new GoogleCloudStorageEntity(storage, location);
}
- @Override
- protected Stream<InputSplit<List<CloudObjectLocation>>>
getPrefixesSplitStream(@Nonnull SplitHintSpec splitHintSpec)
- {
- final Iterator<List<StorageObject>> splitIterator = splitHintSpec.split(
- storageObjectIterable().iterator(),
- storageObject -> {
- final BigInteger sizeInBigInteger = storageObject.getSize();
- long sizeInLong;
- if (sizeInBigInteger == null) {
- sizeInLong = Long.MAX_VALUE;
- } else {
- try {
- sizeInLong = sizeInBigInteger.longValueExact();
- }
- catch (ArithmeticException e) {
- LOG.warn(
- e,
- "The object [%s, %s] has a size [%s] out of the range of the
long type. "
- + "The max long value will be used for its size instead.",
- storageObject.getBucket(),
- storageObject.getName(),
- sizeInBigInteger
- );
- sizeInLong = Long.MAX_VALUE;
- }
- }
- return new InputFileAttribute(sizeInLong);
- }
- );
-
- return Streams.sequentialStreamFrom(splitIterator)
- .map(objects ->
objects.stream().map(this::byteSourceFromStorageObject).collect(Collectors.toList()))
- .map(InputSplit::new);
- }
-
@Override
public SplittableInputSource<List<CloudObjectLocation>>
withSplit(InputSplit<List<CloudObjectLocation>> split)
{
return new GoogleCloudStorageInputSource(storage, inputDataConfig, null,
null, split.get(), getObjectGlob());
}
- private CloudObjectLocation byteSourceFromStorageObject(final StorageObject
storageObject)
+ @Override
+ protected CloudObjectSplitWidget getSplitWidget()
{
- return GoogleUtils.objectToCloudObjectLocation(storageObject);
+ class SplitWidget implements CloudObjectSplitWidget
+ {
+ @Override
+ public Iterator<LocationWithSize>
getDescriptorIteratorForPrefixes(List<URI> prefixes)
+ {
+ return Iterators.transform(
+ GoogleUtils.lazyFetchingStorageObjectsIterator(
+ storage,
+ prefixes.iterator(),
+ inputDataConfig.getMaxListingLength()
+ ),
+ object -> new LocationWithSize(object.getBucket(),
object.getName(), getSize(object))
+ );
+ }
+
+ @Override
+ public long getObjectSize(CloudObjectLocation location) throws
IOException
+ {
+ final StorageObject storageObject =
storage.getMetadata(location.getBucket(), location.getPath());
+ return getSize(storageObject);
+ }
+ }
+
+ return new SplitWidget();
}
- private Iterable<StorageObject> storageObjectIterable()
+ private static long getSize(final StorageObject object)
{
- return () -> {
- Iterator<StorageObject> iterator =
GoogleUtils.lazyFetchingStorageObjectsIterator(
- storage,
- getPrefixes().iterator(),
- inputDataConfig.getMaxListingLength()
- );
+ final BigInteger sizeInBigInteger = object.getSize();
- // Skip files that didn't match glob filter.
- if (StringUtils.isNotBlank(getObjectGlob())) {
- PathMatcher m = FileSystems.getDefault().getPathMatcher("glob:" +
getObjectGlob());
-
- iterator = Iterators.filter(
- iterator,
- object -> m.matches(Paths.get(object.getName()))
+ if (sizeInBigInteger == null) {
+ return Long.MAX_VALUE;
Review Comment:
TLDR: I suspect this code path never actually executes.
> Given how the size is now used, is this the right answer? A naive splitter
will create a huge number of splits for a file of size 2^63.
We don't make partial-file splits, so this isn't a concern.
> Under what circumstances would the size be unknown?
☝️ Fwiw, this is the main reason I didn't change the logic (the code is
pre-existing; I moved it due to a refactor, but it's otherwise the same as what
was there before). I didn't know when the size would be unknown, and hadn't
spent time looking into it.
However, spending a couple minutes now, I found a couple of things:
- In the code review for the PR that added the logic
(https://github.com/apache/druid/pull/9360/files#r382871722), there was a
mention of the possibility of the size overflowing a signed long. That isn't
realistic: an object can't be that big. So I don't think the
`ArithmeticException` code path is really going to happen.
- Google's documentation for `StorageObject#getSize()` says it returns
"value or `null` for none". However, it doesn't say when the size might be
`null`. Neither does the documentation for the JSON API that it comes from:
https://cloud.google.com/storage/docs/json_api/v1/objects. In fact, there is no
mention of `size` perhaps being `null` or missing. So it's possible it's never
actually `null`.
It seems likely to me that neither of these things actually happens in
production… however, I don't really want to find out by removing this stuff in
this PR 🙂. Would be reasonable to remove it in a separate PR, though. That way,
if the analysis here is wrong, we can easily revert that separate commit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]