Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/14247 )
Change subject: IMPALA-8942: Set file format specific split sizes on non-block stores ...................................................................... IMPALA-8942: Set file format specific split sizes on non-block stores On non-block based stores (e.g. S3, ADLS, etc.), the planner creates split sizes based on the value of FileSystem.getDefaultBlockSize(Path). This does not work well for Parquet, because the scanners will only process a split if the data range defined by the split overlaps with the midpoint of the Parquet row group. This is done to ensure that scanners treat Parquet row groups as the unit of processing. The default block size for non-block based stores is typically much lower than the Parquet row group size. This causes a lot of dummy Parquet splits to be created and processed, most of which end up doing nothing. The major issue this causes is skew, and each scanner ends up processing a skewed amount of data (see IMPALA-3453 for details on the skew issue). This patch adds a new query option PARQUET_OBJECT_STORE_SPLIT_SIZE (defaults to 256 MB) that controls the size of Parquet splits on non-block stores. Impala docs actually recommend setting fs.s3a.block.size to 128 MB (row group size used by Hive / Spark) or 256 MB (row group size used by Impala). Setting the block size to the row group size results in ideal split assignment, but experiments show that using a 256 MB block size for 128 MB row groups is better than using a 128 MB block size for 256 MB row groups, so the default value of PARQUET_OBJECT_STORE_SPLIT_SIZE is 256 MB. Updated the docs accordingly. Testing: * Ran core tests * Added tests to test_scanners.py Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6 Reviewed-on: http://gerrit.cloudera.org:8080/14247 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> --- M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M docs/topics/impala_s3.xml M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M tests/query_test/test_scanners.py 7 files changed, 92 insertions(+), 4 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/14247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6 Gerrit-Change-Number: 14247 Gerrit-PatchSet: 7 Gerrit-Owner: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Alex Rodoni <arod...@cloudera.com> Gerrit-Reviewer: David Rorke <dro...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>