gianm opened a new pull request, #13955:
URL: https://github.com/apache/druid/pull/13955

   1) Update CloudObjectInputSource and its subclasses (S3, GCS,
      Azure, Aliyun OSS) to use SplitHintSpecs in all cases. Previously, they
      were only used for prefixes, not uris or objects.
   
   2) Update ExternalInputSpecSlicer (MSQ) to consider file size. Previously,
      file size was ignored; all files were treated as equal weight when
      determining splits.
   
   A side effect of these changes is that we'll make additional network calls 
to find the sizes of objects when users specify URIs or objects as opposed to 
prefixes. IMO, this is worth it because it's the only way to respect the user's 
split hint and task assignment settings.
   
   Secondary changes:
   
   1) S3, Aliyun OSS: Use getObjectMetadata instead of listObjects to get
      metadata for a single object. This is a simpler call that is also
      expected to be less expensive.
   
   2) Azure: Fix a bug where getBlobLength did not populate blob
      reference attributes, and therefore would not actually retrieve the
      blob length.
   
   3) MSQ: Align dynamic slicing logic between ExternalInputSpecSlicer and
      TableInputSpecSlicer.
   
   4) MSQ: Adjust WorkerInputs to ensure there is always at least one
      worker, even if it has a nil slice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to