Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12004 )
Change subject: IMPALA-7889: Write new logical types in Parquet ...................................................................... Patch Set 5: (17 comments) http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/hdfs-parquet-table-writer.cc@973 PS5, Line 973: parquet::SchemaElement& node = file_metadata_.schema[i + 1]; Maybe choose a better name here? http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/parquet-metadata-utils.h File be/src/exec/parquet/parquet-metadata-utils.h: http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/parquet-metadata-utils.h@62 PS5, Line 62: static void FillSchemaElement(parquet::SchemaElement& node, const ColumnType& type, We usually put input parameters first, then output, i.e. move node to the end. By convention we also pass output parameters by pointer. http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/parquet-metadata-utils.cc@73 PS5, Line 73: }; This is actually the end of the anonymous namespace and confused me. Please remove the semicolon and add a comment // anonymous namespace. http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/parquet-metadata-utils.cc@78 PS5, Line 78: bool IsSupportedType(PrimitiveType impala_type, This method should be in the anonymous namespace http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/parquet-metadata-utils.cc@153 PS5, Line 153: static bool IsEncodingSupported(parquet::Encoding::type e) { This can be in the anonymous namespace, too http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/parquet-metadata-utils.cc@283 PS5, Line 283: void SetIntLogicalType(parquet::SchemaElement& node, int bitwidth) { Move to the anonymous namespace. Pass output parameters by pointer and move to the end of the parameter list. Also add a brief function comment. http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/parquet-metadata-utils.cc@292 PS5, Line 292: node Can you think of a better name for the schema element than "node"? http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/parquet-metadata-utils.cc@297 PS5, Line 297: if (type.type == TYPE_DECIMAL) { Should this be a case statement instead? http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/parquet-metadata-utils.cc@310 PS5, Line 310: else if (type.type == TYPE_VARCHAR || type.type == TYPE_CHAR || Move else to previous line. http://gerrit.cloudera.org:8080/#/c/12004/5/be/src/exec/parquet/parquet-metadata-utils.cc@311 PS5, Line 311: (type.type == TYPE_STRING && query_options.parquet_annotate_strings_utf8)) { Please add a brief comment that VARCHAR and CHAR are always UTF8 annotated, but STRING is only when set by the query option (I had to look it up). http://gerrit.cloudera.org:8080/#/c/12004/5/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: http://gerrit.cloudera.org:8080/#/c/12004/5/tests/query_test/test_insert_parquet.py@362 PS5, Line 362: def _ctas_and_get_metadata(self, vector, unique_database, tmp_dir, source_table): Add function comment (here and for all other new functions http://gerrit.cloudera.org:8080/#/c/12004/5/tests/query_test/test_insert_parquet.py@376 PS5, Line 376: file_metadata_list = get_parquet_metadata_from_hdfs_folder(hdfs_path, tmp_dir) Maybe you can clean up some of the other uses of os.walk() in this file? http://gerrit.cloudera.org:8080/#/c/12004/5/tests/query_test/test_insert_parquet.py@388 PS5, Line 388: schema = self._find_schema(schemas, column_name) You could add assert schema is not None You could also add that in _find_schema, since it seems required that the schema exists. http://gerrit.cloudera.org:8080/#/c/12004/5/tests/query_test/test_insert_parquet.py@394 PS5, Line 394: bit_width_to_converted_type_map = {8: 15, 16: 16, 32: 17, 64: 18} Can you replace the values with the names from parquet.thrift? http://gerrit.cloudera.org:8080/#/c/12004/5/tests/query_test/test_insert_parquet.py@397 PS5, Line 397: assert schema.logicalType.INTEGER is not None The other types have to be None, right? If so, maybe create a helper _check_only_one_type_is_set(the_type) and then make sure that all others are None. http://gerrit.cloudera.org:8080/#/c/12004/5/tests/query_test/test_insert_parquet.py@432 PS5, Line 432: if nit: once http://gerrit.cloudera.org:8080/#/c/12004/5/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/12004/5/tests/util/get_parquet_metadata.py@181 PS5, Line 181: for f in files: Add (as some uses of this in test_insert_parquet.py do): if not f.endswith('parq'): continue -- To view, visit http://gerrit.cloudera.org:8080/12004 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6f377950845683ab9c6dea79f4c54db0359d0b91 Gerrit-Change-Number: 12004 Gerrit-PatchSet: 5 Gerrit-Owner: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Comment-Date: Fri, 30 Nov 2018 05:10:39 +0000 Gerrit-HasComments: Yes
