[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Quanlong Huang (Code Review) Fri, 08 Mar 2019 07:24:08 -0800

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )


Change subject: IMPALA-6503: Support reading complex types from ORC
......................................................................


Patch Set 25:

(15 comments)

> Patch Set 25:
>
> (33 comments)
>
> Ah, I missed that this was almost committed. I had started reviewing but 
> hadn't finished so hadn't posted the comments yet. I'll post my comments here 
> for now.

Todd, thanks for your helpful comments! I should notify you before I run GVO... 
What if I can get this several hours earlier!

> Quanlong, do you mind if I build a patch on this to take care of my suggested 
> cleanups?

I feel like we still need to polish the codes (especially some comments) to 
improve readablity. Most of the reasons and examples are given in the Google 
Docs and slides 
(https://docs.google.com/presentation/d/1uj8m7y69o47MhpqCc0SJ03GDTtPDrg4m04eAFVmq34A).
 We may need to write them down in comments.

It'll be great if Cloudera folks can join the development of the ORC support. 
Todd, feel free to create a follow-up JIRA for this. I'll reply more comments 
in details later.

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@182
PS18, Line 182: tch batch updated
> I noticed that the ORC docs are not very consistent on terminology here. Th
Yes. "Type" and "node" in the ORC relative codes all means "column".


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@235
PS18, Line 235:   /// Advances 'stripe_idx_' to the next non-empty stripe and 
initializes
> can you update the doc to explain the 'column_reader' parameter here? The d
Sorry for my misleading comments.
The "orc_root_reader_" keeps reference of the "orc_root_batch_". Each child of 
"orc_root_reader_" keeps the reference of the corresponding sub batch of 
"orc_root_batch_" and so on recursively.
What I want to explain is that the "column_reader" will read values of the orc 
batch it refers and then materialize tuples of "dst_batch".


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@243
PS18, Line 243:
> 'selected_indices' doesn't seem to exist here. Should it be "selected nodes
Yes. Fixed in a later version of the patch.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@244
PS18, Line 244:   /// Materialize collection(list/map) tuples belong to the 
'row_idx'-th row of
> I'm having trouble understanding this documentation. Maybe worth giving an
Yes. There're some examples here: 
https://docs.google.com/presentation/d/1uj8m7y69o47MhpqCc0SJ03GDTtPDrg4m04eAFVmq34A


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@72
PS18, Line 72:  OrcColumnReader(const orc::Type* node, const SlotDescriptor* 
slot_desc,
             :       HdfsOrcScanner* scanner);
> instead of having both a public constructor and a public factory function a
Yes! We can make it protected.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@79
PS18, Line 79:   virtual bool MaterializeTuple() const { return true; }
> Can we rename this to sound more like a boolean? As is, it sounds like a fu
Sure. Actually, I struggled to think of a name. Maybe better to be 
"shouldMaterializeTuples". This function is mainly used in complex type readers.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@88
PS18, Line 88: orc::ColumnVectorBatch
> can this be const?
I tried to do this before but failed because orc::ColumnVectorBatch::notNull is 
a "orc::DataBuffer<char>". We need to use "notNull[row_idx]" but the operator[] 
of "orc::DataBuffer" is lake of a const version: 
https://github.com/apache/orc/blob/fa0a33e4a7331dd21485bf4ccc0a93edf4b3ae16/c++/include/orc/MemoryPool.hh#L72


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@122
PS18, Line 122: class OrcBoolColumnReader : public OrcColumnReader {
> do we need to define all the subtypes of readers in the .h file or could th
I think they sould be here since HdfsOrcScanner reference them as friend 
classes and hdfs-orc-scanner.h only includes this .h file.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@46
PS18, Line 46:     switch (slot_desc->type().type) {
> I think it's worth a comment explaining why we switched above on node->getK
Yes. There's a reason why we switch on Impala slot type here. Will sumarize it 
later.
We've verify the compatibility before we create the column readers. The 
verification is in OrcSchemaResolver::ResolveColumn.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@102
PS18, Line 102:         DCHECK(false) << slot_desc->type().DebugString();
> I think we should be more defensive here and actually return a bad Status i
Since we swith on Impala slot type, we won't hit UNION here. But we really need 
to consider DATE type here after we merge https://gerrit.cloudera.org/c/12481/


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@112
PS18, Line 112:   if (node->getKind() == orc::TypeKind::STRUCT) {
> I'm a bit confused. Isn't the top-level reader always going to be a STRUCT?
No. "top-level reader" is not the root reader (orc_root_reader_ in 
hdfs-orc-scanner). I struggled to name it too. Maybe you can find me a better 
name :)
The column reader that materializes the table-level (top-level) tuple is a 
"top-level reader". All the ancestors of this reader are also defined as 
"top-level readers".
Note that the table-level tuple is not neccessarily mapped to a row of the 
table. It depends on the TupleDescriptors (so depends on the query). My slides 
have some examples about this.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@25
PS18, Line 25: void OrcMetadataUtils::BuildSchemaPaths(const orc::Type& root, 
int num_partition_keys,
> it's nice that you factored out these utility functions. Would it be possib
Yes. Good point!


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@40
PS18, Line 40:   DCHECK_EQ(paths->size(), node.getColumnId());
> Is this guaranteed that the column IDs in an ORC file are in sequential ord
Yes. The ORC column IDs are the pre-order traversal ids of the schema tree.


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@65
PS18, Line 65:   }
> do we need an 'else { return Status(...); } here? What if the ORC file has
Ah, yes! We should take care of this!


http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@184
PS18, Line 184:       "Type mismatch: table column $0 is map to column $1 in 
ORC file '$2'",
> would be good if you could include the string column paths here (both the i
Sure!



--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 25
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 08 Mar 2019 15:23:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC

Reply via email to