Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10483 )

Change subject: IMPALA-7044: Prevent overflow when computing Parquet block size
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@795
PS4, Line 795: num_cols
it's a bit misleading to say "number of columns in the target table" and give 
this value since the value is the columns in the parquet file not the table 
(when partitioned).


http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@796
PS4, Line 796:   }
The preexisting block size logic is pretty convoluted. It appears that this 
check is completely redundant with the one you are adding to hdfs-table-writer 
and the check in HdfsParquetTableWriter::InitNewFile(). I'm not opposed to 
adding this since the error message is more informative, but I don't think that 
was your intent, so wanted to check that logic. And am wondering if we could 
consolidate the checks rather than having them sprinkled around (like move this 
check to HdfsParquetTableWriter::InitNewFile(). hmm, but then it might be too 
late since we'd have already hit the "generic" 2GB check you're adding to 
HdfsTableWriter.


http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@925
PS4, Line 925: num_cols
maybe call this num_file_columns to be clear that it's the file columns not the 
table columns.


http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@979
PS4, Line 979: a table with " << num_cols << " columns
not your change, but this is also confused.


http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-table-sink.cc@371
PS4, Line 371: output_partition->partition_descriptor->block_size();
in what cases can this be > 2GB? I'm wondering if we should just cap this value 
to 2GB, and only issue the 2GB error for the cases where user can actually do 
something about it.



--
To view, visit http://gerrit.cloudera.org:8080/10483
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6e63420e5a093c0bbc789201771708865b16e138
Gerrit-Change-Number: 10483
Gerrit-PatchSet: 4
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Thomas Marshall <[email protected]>
Gerrit-Comment-Date: Wed, 30 May 2018 21:19:15 +0000
Gerrit-HasComments: Yes

Reply via email to