Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10483 )

Change subject: IMPALA-7044: Prevent overflow when computing Parquet block size
......................................................................


Patch Set 4:

(5 comments)

Thanks for the review. Please see PS5 and my inline comments. Let me know if 
you'd like to chat in person about the various error checks.

http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@795
PS4, Line 795: num_cols
> it's a bit misleading to say "number of columns in the target table" and gi
Updated the message


http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@796
PS4, Line 796:   }
> The preexisting block size logic is pretty convoluted. It appears that this
The check in hdfs-table-writer is supposed to catch future errors and act as a 
backstop to prevent the overflow (see my longer comment there).

The check in InitNewFile() below makes sure that the configured file size limit 
is large enough for the number of non-partitioning columns. Users can also hit 
it by setting the parquet_file_size option to limit the target file size.


http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@925
PS4, Line 925: num_cols
> maybe call this num_file_columns to be clear that it's the file columns not
Done


http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@979
PS4, Line 979: a table with " << num_cols << " columns
> not your change, but this is also confused.
Updated the message


http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-table-sink.cc@371
PS4, Line 371: output_partition->partition_descriptor->block_size();
> in what cases can this be > 2GB? I'm wondering if we should just cap this v
My intent here is to catch any overflow that does not get caught elsewhere in 
the code. When calling hdfsOpenFile() below we must be sure that the value is 
<= 2GB. Introducing the possibility for such an error elsewhere in the code 
would not necessarily trigger it, so I opted to log it and abort the query 
instead of letting the overflow occur. At this point a user might not be able 
to fix the query, but at least they don't get unexpected results (incorrect 
results, crashes).

Should I add a comment to explain the intent better?



--
To view, visit http://gerrit.cloudera.org:8080/10483
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6e63420e5a093c0bbc789201771708865b16e138
Gerrit-Change-Number: 10483
Gerrit-PatchSet: 4
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Thomas Marshall <[email protected]>
Gerrit-Comment-Date: Thu, 31 May 2018 20:02:41 +0000
Gerrit-HasComments: Yes

Reply via email to