Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/10483 )
Change subject: IMPALA-7044: Prevent overflow when computing Parquet block size ...................................................................... Patch Set 4: (5 comments) Thanks for the review. Please see PS5 and my inline comments. Let me know if you'd like to chat in person about the various error checks. http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@795 PS4, Line 795: num_cols > it's a bit misleading to say "number of columns in the target table" and gi Updated the message http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@796 PS4, Line 796: } > The preexisting block size logic is pretty convoluted. It appears that this The check in hdfs-table-writer is supposed to catch future errors and act as a backstop to prevent the overflow (see my longer comment there). The check in InitNewFile() below makes sure that the configured file size limit is large enough for the number of non-partitioning columns. Users can also hit it by setting the parquet_file_size option to limit the target file size. http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@925 PS4, Line 925: num_cols > maybe call this num_file_columns to be clear that it's the file columns not Done http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-parquet-table-writer.cc@979 PS4, Line 979: a table with " << num_cols << " columns > not your change, but this is also confused. Updated the message http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: http://gerrit.cloudera.org:8080/#/c/10483/4/be/src/exec/hdfs-table-sink.cc@371 PS4, Line 371: output_partition->partition_descriptor->block_size(); > in what cases can this be > 2GB? I'm wondering if we should just cap this v My intent here is to catch any overflow that does not get caught elsewhere in the code. When calling hdfsOpenFile() below we must be sure that the value is <= 2GB. Introducing the possibility for such an error elsewhere in the code would not necessarily trigger it, so I opted to log it and abort the query instead of letting the overflow occur. At this point a user might not be able to fix the query, but at least they don't get unexpected results (incorrect results, crashes). Should I add a comment to explain the intent better? -- To view, visit http://gerrit.cloudera.org:8080/10483 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6e63420e5a093c0bbc789201771708865b16e138 Gerrit-Change-Number: 10483 Gerrit-PatchSet: 4 Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Thomas Marshall <[email protected]> Gerrit-Comment-Date: Thu, 31 May 2018 20:02:41 +0000 Gerrit-HasComments: Yes
