Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10483 )
Change subject: IMPALA-7044: Prevent overflow when computing Parquet block size ...................................................................... Patch Set 4: (5 comments) http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@795 PS4, Line 795: num_cols it's a bit misleading to say "number of columns in the target table" and give this value since the value is the columns in the parquet file not the table (when partitioned). http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@796 PS4, Line 796: } The preexisting block size logic is pretty convoluted. It appears that this check is completely redundant with the one you are adding to hdfs-table-writer and the check in HdfsParquetTableWriter::InitNewFile(). I'm not opposed to adding this since the error message is more informative, but I don't think that was your intent, so wanted to check that logic. And am wondering if we could consolidate the checks rather than having them sprinkled around (like move this check to HdfsParquetTableWriter::InitNewFile(). hmm, but then it might be too late since we'd have already hit the "generic" 2GB check you're adding to HdfsTableWriter. http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@925 PS4, Line 925: num_cols maybe call this num_file_columns to be clear that it's the file columns not the table columns. http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@979 PS4, Line 979: a table with " << num_cols << " columns not your change, but this is also confused. http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-table-sink.cc@371 PS4, Line 371: output_partition->partition_descriptor->block_size(); in what cases can this be > 2GB? I'm wondering if we should just cap this value to 2GB, and only issue the 2GB error for the cases where user can actually do something about it. -- To view, visit http://gerrit.cloudera.org:8080/10483 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6e63420e5a093c0bbc789201771708865b16e138 Gerrit-Change-Number: 10483 Gerrit-PatchSet: 4 Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Thomas Marshall <[email protected]> Gerrit-Comment-Date: Wed, 30 May 2018 21:19:15 +0000 Gerrit-HasComments: Yes
