Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )
Change subject: IMPALA-6503: Support reading complex types from ORC ...................................................................... Patch Set 25: (15 comments) > Patch Set 25: > > (33 comments) > > Ah, I missed that this was almost committed. I had started reviewing but > hadn't finished so hadn't posted the comments yet. I'll post my comments here > for now. Todd, thanks for your helpful comments! I should notify you before I run GVO... What if I can get this several hours earlier! > Quanlong, do you mind if I build a patch on this to take care of my suggested > cleanups? I feel like we still need to polish the codes (especially some comments) to improve readablity. Most of the reasons and examples are given in the Google Docs and slides (https://docs.google.com/presentation/d/1uj8m7y69o47MhpqCc0SJ03GDTtPDrg4m04eAFVmq34A). We may need to write them down in comments. It'll be great if Cloudera folks can join the development of the ORC support. Todd, feel free to create a follow-up JIRA for this. I'll reply more comments in details later. http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@182 PS18, Line 182: tch batch updated > I noticed that the ORC docs are not very consistent on terminology here. Th Yes. "Type" and "node" in the ORC relative codes all means "column". http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@235 PS18, Line 235: /// Advances 'stripe_idx_' to the next non-empty stripe and initializes > can you update the doc to explain the 'column_reader' parameter here? The d Sorry for my misleading comments. The "orc_root_reader_" keeps reference of the "orc_root_batch_". Each child of "orc_root_reader_" keeps the reference of the corresponding sub batch of "orc_root_batch_" and so on recursively. What I want to explain is that the "column_reader" will read values of the orc batch it refers and then materialize tuples of "dst_batch". http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@243 PS18, Line 243: > 'selected_indices' doesn't seem to exist here. Should it be "selected nodes Yes. Fixed in a later version of the patch. http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/hdfs-orc-scanner.h@244 PS18, Line 244: /// Materialize collection(list/map) tuples belong to the 'row_idx'-th row of > I'm having trouble understanding this documentation. Maybe worth giving an Yes. There're some examples here: https://docs.google.com/presentation/d/1uj8m7y69o47MhpqCc0SJ03GDTtPDrg4m04eAFVmq34A http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@72 PS18, Line 72: OrcColumnReader(const orc::Type* node, const SlotDescriptor* slot_desc, : HdfsOrcScanner* scanner); > instead of having both a public constructor and a public factory function a Yes! We can make it protected. http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@79 PS18, Line 79: virtual bool MaterializeTuple() const { return true; } > Can we rename this to sound more like a boolean? As is, it sounds like a fu Sure. Actually, I struggled to think of a name. Maybe better to be "shouldMaterializeTuples". This function is mainly used in complex type readers. http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@88 PS18, Line 88: orc::ColumnVectorBatch > can this be const? I tried to do this before but failed because orc::ColumnVectorBatch::notNull is a "orc::DataBuffer<char>". We need to use "notNull[row_idx]" but the operator[] of "orc::DataBuffer" is lake of a const version: https://github.com/apache/orc/blob/fa0a33e4a7331dd21485bf4ccc0a93edf4b3ae16/c++/include/orc/MemoryPool.hh#L72 http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.h@122 PS18, Line 122: class OrcBoolColumnReader : public OrcColumnReader { > do we need to define all the subtypes of readers in the .h file or could th I think they sould be here since HdfsOrcScanner reference them as friend classes and hdfs-orc-scanner.h only includes this .h file. http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@46 PS18, Line 46: switch (slot_desc->type().type) { > I think it's worth a comment explaining why we switched above on node->getK Yes. There's a reason why we switch on Impala slot type here. Will sumarize it later. We've verify the compatibility before we create the column readers. The verification is in OrcSchemaResolver::ResolveColumn. http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@102 PS18, Line 102: DCHECK(false) << slot_desc->type().DebugString(); > I think we should be more defensive here and actually return a bad Status i Since we swith on Impala slot type, we won't hit UNION here. But we really need to consider DATE type here after we merge https://gerrit.cloudera.org/c/12481/ http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-column-readers.cc@112 PS18, Line 112: if (node->getKind() == orc::TypeKind::STRUCT) { > I'm a bit confused. Isn't the top-level reader always going to be a STRUCT? No. "top-level reader" is not the root reader (orc_root_reader_ in hdfs-orc-scanner). I struggled to name it too. Maybe you can find me a better name :) The column reader that materializes the table-level (top-level) tuple is a "top-level reader". All the ancestors of this reader are also defined as "top-level readers". Note that the table-level tuple is not neccessarily mapped to a row of the table. It depends on the TupleDescriptors (so depends on the query). My slides have some examples about this. http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc File be/src/exec/orc-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@25 PS18, Line 25: void OrcMetadataUtils::BuildSchemaPaths(const orc::Type& root, int num_partition_keys, > it's nice that you factored out these utility functions. Would it be possib Yes. Good point! http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@40 PS18, Line 40: DCHECK_EQ(paths->size(), node.getColumnId()); > Is this guaranteed that the column IDs in an ORC file are in sequential ord Yes. The ORC column IDs are the pre-order traversal ids of the schema tree. http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@65 PS18, Line 65: } > do we need an 'else { return Status(...); } here? What if the ORC file has Ah, yes! We should take care of this! http://gerrit.cloudera.org:8080/#/c/12168/18/be/src/exec/orc-metadata-utils.cc@184 PS18, Line 184: "Type mismatch: table column $0 is map to column $1 in ORC file '$2'", > would be good if you could include the string column paths here (both the i Sure! -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 25 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Fri, 08 Mar 2019 15:23:43 +0000 Gerrit-HasComments: Yes
