[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Todd Lipcon (Code Review) Thu, 07 Mar 2019 21:57:29 -0800

Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )


Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 25:

(33 comments)

Ah, I missed that this was almost committed. I had started reviewing but hadn't 
finished so hadn't posted the comments yet. I'll post my comments here for now. 
Quanlong, do you mind if I build a patch on this to take care of my suggested 
cleanups?

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@29
PS18, Line 29: #include "exec/orc-metadata-utils.h"
nit: do you need to include this here or could you get by with forward 
declarations?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@155
PS18, Line 155:   int64_t stripe_rows_read_ = 0;
nit: nullptr is the default initializer already for unique_ptr


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@174
PS18, Line 174:
nit: "owned by"


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@182
PS18, Line 182: tch batch updated
I noticed that the ORC docs are not very consistent on terminology here. These 
are equivalent to "column" IDs as referenced above, right?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@235
PS18, Line 235:   /// Advances 'stripe_idx_' to the next non-empty stripe and 
initializes
can you update the doc to explain the 'column_reader' parameter here? The docs 
say this transfers tuples from orc_root_batch_, but in fact maybe it is 
transferring from the provided column reader?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@243
PS18, Line 243:
'selected_indices' doesn't seem to exist here. Should it be "selected nodes"?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@244
PS18, Line 244:   /// Materialize collection(list/map) tuples belong to the 
'row_idx'-th row of
I'm having trouble understanding this documentation. Maybe worth giving an 
example here?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@249
PS18, Line 249:
row_reader_options_


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@33
PS18, Line 33: types
nit: typo: "type"


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@41
PS18, Line 41: ///     while ( /* has more rows to read */ ) {
Performance-wise, would it be possible to actually materialize a column at a 
time? Or do you think doing it a tuple at a time will be fast once we later 
introduce codegen?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@57
PS18, Line 57: guaranteed
typo: guarantee


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@72
PS18, Line 72:  OrcColumnReader(const orc::Type* node, const SlotDescriptor* 
slot_desc,
             :       HdfsOrcScanner* scanner);
instead of having both a public constructor and a public factory function 
above, could this be another factory function, with an appropriate name? Or 
could this be made private if it's only meant to be used by the factory 
function itself?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@79
PS18, Line 79:   virtual bool MaterializeTuple() const { return true; }
Can we rename this to sound more like a boolean? As is, it sounds like a 
function which materializes a tuple.

I'm also a little confused on the description -- I would think that as 
described, this function returns whether or not the reader is responsible for 
materializing a specific slot?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@88
PS18, Line 88: orc::ColumnVectorBatch
can this be const?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@122
PS18, Line 122: class OrcBoolColumnReader : public OrcColumnReader {
do we need to define all the subtypes of readers in the .h file or could they 
be moved to the .cc file?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@46
PS18, Line 46:     switch (slot_desc->type().type) {
I think it's worth a comment explaining why we switched above on 
node->getKind() but from here downward we're switching on the Impala slot type. 
Have we already verified the compatibility elsewhere? Or will we verify 
compatibility inside the constructor?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@102
PS18, Line 102:         DCHECK(false) << slot_desc->type().DebugString();
I think we should be more defensive here and actually return a bad Status in 
the case of an unknown type. It seems we're missing UNION and DATE at least 
here. We don't want to crash in the case that someone passes an ORC file with 
such a bad type.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@112
PS18, Line 112:   if (node->getKind() == orc::TypeKind::STRUCT) {
I'm a bit confused. Isn't the top-level reader always going to be a STRUCT?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h
File be/src/exec/orc-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h@29
PS18, Line 29: class OrcMetadataUtils {
docs


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h@39
PS18, Line 39: class OrcSchemaResolver {
can you document the lifetime requirements of all these parameters? It seems 
all of the pointers passed in must be kept alive while the OrcSchemaResolver is 
in use.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h@44
PS18, Line 44:   /// Resolve SchemaPath into orc::Type (ORC column 
representation)
in this case, would 'node' still be returned?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.h@48
PS18, Line 48:   Status ResolveColumn(const SchemaPath& col_path, const 
orc::Type** node,
reference-typed members are against the style guide. I think this should be a 
const pointer instead


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@25
PS18, Line 25: void OrcMetadataUtils::BuildSchemaPaths(const orc::Type& root, 
int num_partition_keys,
it's nice that you factored out these utility functions. Would it be possible 
to write a unit test for them?  It seems like 'orc::Type' is pretty easy to 
construct for a test using orc::Type::buildTypeFromString(...): 
https://github.com/apache/orc/blob/d81131b9d608779f48d7de6a3c11ece03a3bc0a0/c%2B%2B/test/TestType.cc#L280


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@26
PS18, Line 26:     vector<SchemaPath>* paths) {
worth a DCHECK(paths->empty());


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@27
PS18, Line 27:   SchemaPath path;
             :   paths->push_back(path);
             :   DCHECK_EQ(root.getKind(), orc::TypeKind::STRUCT);
             :   int num_columns = root.getSubtypeCount();
             :   for (int i = 0; i < num_columns; ++i) {
             :     path.push_back(i + num_partition_keys);
             :     BuildSchemaPath(*root.getSubtype(i), &path, paths);
             :     path.pop_back();
             :   }
it seems like this code is the same handling as the 'STRUCT' case below in 
BuildSchemaPath. Could you just implement this as:

SchemaPath path;
BuildSchemaPath(root, &path, paths);


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@40
PS18, Line 40:   DCHECK_EQ(paths->size(), node.getColumnId());
Is this guaranteed that the column IDs in an ORC file are in sequential order 
with no gaps?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@65
PS18, Line 65:   }
do we need an 'else { return Status(...); } here? What if the ORC file has some 
corruption, we don't want to crash. Also seems this is missing UNION, so even 
if we can't support UNION initially we should make sure to return a good error 
back to the user


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@89
PS18, Line 89:       table_col_type = &table_col_type->children[table_idx];
can you add a DCHECK_LT(table_idx, table_col_type->children.size()) here so 
that in debug builds if we got this wrong we'd get an easier crash?


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@100
PS18, Line 100:         return Status(Substitute("File '$0' has an incompatible 
ORC schema for column "
would be good in unit tests to make sure these cases are covered, too


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@184
PS18, Line 184:       "Type mismatch: table column $0 is map to column $1 in 
ORC file '$2'",
would be good if you could include the string column paths here (both the 
impala-side and orc-side). I was trying to use the existing ORC support last 
night and got very confused for a while by "table column bigint is mapped to 
column int" since I couldn't figure out which column I'd messed up the table 
definition for.


http://gerrit.cloudera.org:8080/#/c/12168/18/testdata/ComplexTypesTbl/README
File testdata/ComplexTypesTbl/README:

http://gerrit.cloudera.org:8080/#/c/12168/18/testdata/ComplexTypesTbl/README@10
PS18, Line 10: The ORC files can be regenerated by running the following 
commands in current directory:
maybe it's worth providing a simple python script to regenerate the nested orc 
data, including performing the above transformation of the JSON files?


http://gerrit.cloudera.org:8080/#/c/12168/18/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

http://gerrit.cloudera.org:8080/#/c/12168/18/testdata/bin/load_nested.py@358
PS18, Line 358:   parser.add_argument("-f", "--table-format", 
default="parquet/none")  # can be "orc/def"
instead of the 'can be orc/def' comment here, maybe include that as a help=... 
kwarg, so that users of the script can see it?


http://gerrit.cloudera.org:8080/#/c/12168/18/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/12168/18/tests/query_test/test_nested_types.py@613
PS18, Line 613:     elif file_format == 'orc':
worth an 'else' raise exception



--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 25
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 08 Mar 2019 05:56:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Reply via email to