Internal Jenkins has submitted this change and it was merged. Change subject: IMPALA-3453: S3: Uneven split sizes are generated for Parquet causing execution skew ......................................................................
IMPALA-3453: S3: Uneven split sizes are generated for Parquet causing execution skew Previously the Parquet file format was considered by us as a non-splittable file format. However, we have since done some work on our parquet scanner that will assign row groups based on the split that contains them. This allows for us to chop up a parquet file into multiple splits and still have the file be scanned reliably. This patch changes our perception of Parquet as a splittable file format, which now allows synthesizeBlockMetadata() to split a parquet file on S3 into multiple "blocks" instead of assigning one scan range per file, so that there is an even distribution of scan ranges across the cluster, hence minimizing skew greatly. P.S: To control the size of scan ranges for splittable files on S3, you can change the default "block" size for the S3A filesystem which is governed by "fs.s3a.block.size". Its default value is 32MB. Change-Id: Ib1518ad0c89ef35a3b0567c3902e85a41e34bc3d Reviewed-on: http://gerrit.cloudera.org:8080/2968 Reviewed-by: Sailesh Mukil <[email protected]> Tested-by: Internal Jenkins --- M fe/src/main/java/com/cloudera/impala/catalog/HdfsFileFormat.java 1 file changed, 1 insertion(+), 2 deletions(-) Approvals: Internal Jenkins: Verified Sailesh Mukil: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/2968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ib1518ad0c89ef35a3b0567c3902e85a41e34bc3d Gerrit-PatchSet: 4 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Sailesh Mukil <[email protected]>
