Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )
Change subject: IMPALA-6503: Support reading complex types from ORC ...................................................................... Patch Set 25: (33 comments) Ah, I missed that this was almost committed. I had started reviewing but hadn't finished so hadn't posted the comments yet. I'll post my comments here for now. Quanlong, do you mind if I build a patch on this to take care of my suggested cleanups? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@29 PS18, Line 29: #include "exec/orc-metadata-utils.h" nit: do you need to include this here or could you get by with forward declarations? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@155 PS18, Line 155: int64_t stripe_rows_read_ = 0; nit: nullptr is the default initializer already for unique_ptr http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@174 PS18, Line 174: nit: "owned by" http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@182 PS18, Line 182: tch batch updated I noticed that the ORC docs are not very consistent on terminology here. These are equivalent to "column" IDs as referenced above, right? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@235 PS18, Line 235: /// Advances 'stripe_idx_' to the next non-empty stripe and initializes can you update the doc to explain the 'column_reader' parameter here? The docs say this transfers tuples from orc_root_batch_, but in fact maybe it is transferring from the provided column reader? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@243 PS18, Line 243: 'selected_indices' doesn't seem to exist here. Should it be "selected nodes"? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@244 PS18, Line 244: /// Materialize collection(list/map) tuples belong to the 'row_idx'-th row of I'm having trouble understanding this documentation. Maybe worth giving an example here? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@249 PS18, Line 249: row_reader_options_ http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@33 PS18, Line 33: types nit: typo: "type" http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@41 PS18, Line 41: /// while ( /* has more rows to read */ ) { Performance-wise, would it be possible to actually materialize a column at a time? Or do you think doing it a tuple at a time will be fast once we later introduce codegen? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@57 PS18, Line 57: guaranteed typo: guarantee http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@72 PS18, Line 72: OrcColumnReader(const orc::Type* node, const SlotDescriptor* slot_desc, : HdfsOrcScanner* scanner); instead of having both a public constructor and a public factory function above, could this be another factory function, with an appropriate name? Or could this be made private if it's only meant to be used by the factory function itself? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@79 PS18, Line 79: virtual bool MaterializeTuple() const { return true; } Can we rename this to sound more like a boolean? As is, it sounds like a function which materializes a tuple. I'm also a little confused on the description -- I would think that as described, this function returns whether or not the reader is responsible for materializing a specific slot? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@88 PS18, Line 88: orc::ColumnVectorBatch can this be const? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@122 PS18, Line 122: class OrcBoolColumnReader : public OrcColumnReader { do we need to define all the subtypes of readers in the .h file or could they be moved to the .cc file? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@46 PS18, Line 46: switch (slot_desc->type().type) { I think it's worth a comment explaining why we switched above on node->getKind() but from here downward we're switching on the Impala slot type. Have we already verified the compatibility elsewhere? Or will we verify compatibility inside the constructor? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@102 PS18, Line 102: DCHECK(false) << slot_desc->type().DebugString(); I think we should be more defensive here and actually return a bad Status in the case of an unknown type. It seems we're missing UNION and DATE at least here. We don't want to crash in the case that someone passes an ORC file with such a bad type. http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@112 PS18, Line 112: if (node->getKind() == orc::TypeKind::STRUCT) { I'm a bit confused. Isn't the top-level reader always going to be a STRUCT? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h File be/src/exec/orc-metadata-utils.h: http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h@29 PS18, Line 29: class OrcMetadataUtils { docs http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h@39 PS18, Line 39: class OrcSchemaResolver { can you document the lifetime requirements of all these parameters? It seems all of the pointers passed in must be kept alive while the OrcSchemaResolver is in use. http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h@44 PS18, Line 44: /// Resolve SchemaPath into orc::Type (ORC column representation) in this case, would 'node' still be returned? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h@48 PS18, Line 48: Status ResolveColumn(const SchemaPath& col_path, const orc::Type** node, reference-typed members are against the style guide. I think this should be a const pointer instead http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc File be/src/exec/orc-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@25 PS18, Line 25: void OrcMetadataUtils::BuildSchemaPaths(const orc::Type& root, int num_partition_keys, it's nice that you factored out these utility functions. Would it be possible to write a unit test for them? It seems like 'orc::Type' is pretty easy to construct for a test using orc::Type::buildTypeFromString(...): https://github.com/apache/orc/blob/d81131b9d608779f48d7de6a3c11ece03a3bc0a0/c%2B%2B/test/TestType.cc#L280 http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@26 PS18, Line 26: vector<SchemaPath>* paths) { worth a DCHECK(paths->empty()); http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@27 PS18, Line 27: SchemaPath path; : paths->push_back(path); : DCHECK_EQ(root.getKind(), orc::TypeKind::STRUCT); : int num_columns = root.getSubtypeCount(); : for (int i = 0; i < num_columns; ++i) { : path.push_back(i + num_partition_keys); : BuildSchemaPath(*root.getSubtype(i), &path, paths); : path.pop_back(); : } it seems like this code is the same handling as the 'STRUCT' case below in BuildSchemaPath. Could you just implement this as: SchemaPath path; BuildSchemaPath(root, &path, paths); http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@40 PS18, Line 40: DCHECK_EQ(paths->size(), node.getColumnId()); Is this guaranteed that the column IDs in an ORC file are in sequential order with no gaps? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@65 PS18, Line 65: } do we need an 'else { return Status(...); } here? What if the ORC file has some corruption, we don't want to crash. Also seems this is missing UNION, so even if we can't support UNION initially we should make sure to return a good error back to the user http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@89 PS18, Line 89: table_col_type = &table_col_type->children[table_idx]; can you add a DCHECK_LT(table_idx, table_col_type->children.size()) here so that in debug builds if we got this wrong we'd get an easier crash? http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@100 PS18, Line 100: return Status(Substitute("File '$0' has an incompatible ORC schema for column " would be good in unit tests to make sure these cases are covered, too http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@184 PS18, Line 184: "Type mismatch: table column $0 is map to column $1 in ORC file '$2'", would be good if you could include the string column paths here (both the impala-side and orc-side). I was trying to use the existing ORC support last night and got very confused for a while by "table column bigint is mapped to column int" since I couldn't figure out which column I'd messed up the table definition for. http://gerrit.cloudera.org:8080/#/c/12168/18/testdata/ComplexTypesTbl/README File testdata/ComplexTypesTbl/README: http://gerrit.cloudera.org:8080/#/c/12168/18/testdata/ComplexTypesTbl/README@10 PS18, Line 10: The ORC files can be regenerated by running the following commands in current directory: maybe it's worth providing a simple python script to regenerate the nested orc data, including performing the above transformation of the JSON files? http://gerrit.cloudera.org:8080/#/c/12168/18/testdata/bin/load_nested.py File testdata/bin/load_nested.py: http://gerrit.cloudera.org:8080/#/c/12168/18/testdata/bin/load_nested.py@358 PS18, Line 358: parser.add_argument("-f", "--table-format", default="parquet/none") # can be "orc/def" instead of the 'can be orc/def' comment here, maybe include that as a help=... kwarg, so that users of the script can see it? http://gerrit.cloudera.org:8080/#/c/12168/18/tests/query_test/test_nested_types.py File tests/query_test/test_nested_types.py: http://gerrit.cloudera.org:8080/#/c/12168/18/tests/query_test/test_nested_types.py@613 PS18, Line 613: elif file_format == 'orc': worth an 'else' raise exception -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 25 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Fri, 08 Mar 2019 05:56:51 +0000 Gerrit-HasComments: Yes
