Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
......................................................................

IMPALA-8942: Set file format specific split sizes on non-block stores

On non-block based stores (e.g. S3, ADLS, etc.), the planner creates
split sizes based on the value of FileSystem.getDefaultBlockSize(Path).
This does not work well for Parquet, because the scanners will only
process a split if the data range defined by the split overlaps with
the midpoint of the Parquet row group. This is done to ensure that
scanners treat Parquet row groups as the unit of processing. The default
block size for non-block based stores is typically much lower than the
Parquet row group size. This causes a lot of dummy Parquet splits to be
created and processed, most of which end up doing nothing. The major
issue this causes is skew, and each scanner ends up processing a skewed
amount of data (see IMPALA-3453 for details on the skew issue).

This patch adds a new query option PARQUET_OBJECT_STORE_SPLIT_SIZE
(defaults to 256 MB) that controls the size of Parquet splits on
non-block stores.

Impala docs actually recommend setting fs.s3a.block.size to 128 MB
(row group size used by Hive / Spark) or 256 MB (row group size used by
Impala). Setting the block size to the row group size results in ideal
split assignment, but experiments show that using a 256 MB block size
for 128 MB row groups is better than using a 128 MB block size for 256
MB row groups, so the default value of PARQUET_OBJECT_STORE_SPLIT_SIZE is
256 MB. Updated the docs accordingly.

Testing:
* Ran core tests
* Added tests to test_scanners.py

Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Reviewed-on: http://gerrit.cloudera.org:8080/14247
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M docs/topics/impala_s3.xml
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M tests/query_test/test_scanners.py
7 files changed, 92 insertions(+), 4 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Alex Rodoni <arod...@cloudera.com>
Gerrit-Reviewer: David Rorke <dro...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>

Reply via email to