[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Quanlong Huang (Code Review) Sun, 27 Jan 2019 18:23:03 -0800

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )


Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 9:

(22 comments)

Thanks for your feedback! I've added more comments and refactor some codes as 
required.

http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG@25
PS8, Line 25: Don’t like
> nit: 'They are not like', or 'They differ from'
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h@175
PS9, Line 175:   std::list<uint64_t> selected_type_ids_;
> Mention that RowReaderOptions.includeTypes() expects a list (otherwise it's
Yeah, I have no ideas about this too.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h@160
PS8, Line 160:   orc::RowReaderOptions row_reader_options;
> nit: put '_' at the end of the variable name.
Oops! Missed this in the previous patch. Done.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@181
PS8, Line 181: base
> nit: based
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@329
PS8, Line 329:         *template_tuple =
             :             Tuple::Create(tuple_desc.byte_size(), 
template_tuple_pool_.get());
> nit: put curly braces around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@364
PS8, Line 364:     if (id >= node.getColumnId() && id <= 
node.getMaximumColumnId()) return true;
> nit: put curly braces around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@681
PS8, Line 681:       RETURN_IF_ERROR(AssembleCollection(*child_reader, 
child_batch_offset + i,
             :           coll_value_builder));
> nit: add curly braces around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@301
PS9, Line 301:     VLOG_QUERY << "Add ORC column " << 
node->getMaximumColumnId() << " for empty tuple "
> This logging will be too verbose for most purposes - I'd suggest making it
This is quite helpful in debug. I think VLOG_FILE is more suiteable since this 
function will only be called once for each file.


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363:   for (uint64_t id : selected_type_ids)
> nit: braces for multi-line if
I'm quite confused about the definition of "multi-line". The if statement just 
occupies one line. Isn't it a single-line statement? Zoltan also comments that 
the for loop needs a curly brace. Aren't braces optional for single-line 
statement loops?


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@395
PS9, Line 395: selected_type_ids_.size()
> It doesn't really matter here, but with gcc4.9.2 list::size() is actually O
Sure


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@680
PS9, Line 680:     for (int i = 0; i < total_tuples; ++i)
> Use braces for multi-line if
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@19
PS9, Line 19: #ifndef IMPALA_EXEC_ORC_COLUMN_READERS_H
> We started using "#pragma once" instead of the traditional include guards i
Sure. Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@61
PS9, Line 61:   virtual void UpdateInputBatch(orc::ColumnVectorBatch* 
orc_batch) = 0;
> Maybe mention the relationship with ReadValue() - do you need to call it be
Sure. Added class comments. Feel free to polish them since they're not written 
by a native speaker :)


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@101
PS8, Line 101:     batch_ = dynamic_cast<orc::LongVectorBatch*>(orc_batch);
> We avoid doing dynamic casts in release mode.
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@126
PS8, Line 126:     int64_t val = batch_->data.data()[row_idx];
> Maybe you could add a DCHECK that batch_ is not null.
Sure. Add DCHECK_NOTNULL at the first use of batch_. Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@190
PS8, Line 190:   *slot = TimestampValue::FromUnixTimeNanos(secs, nanos,
ORC timestamps are not timezone-aware. Timestamps read/write by the ORC lib are 
assumed in GMT timezone. We used to have a long debate about this in 
https://github.com/apache/orc/pull/233 when I encountered a bug of the writter 
(ORC-320, ORC-322). The short conclusion is updated in the ORC web site 
(https://orc.apache.org/docs/core-cpp.html):

> TimestampVectorBatch handles timestamp values. The data is represented as two 
> buffers of int64_t for seconds and nanoseconds respectively. Note that we 
> always assume data is in GMT timezone; therefore it is user’s responsibility 
> to convert wall clock time from local timezone to GMT.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@326
PS8, Line 326:     
children_[c]->UpdateInputBatch(batch_->fields[children_fields_[c]]);
> nit: add curly brackets around it since it is a multi-line for stmt.
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@331
PS8, Line 331:     RETURN_IF_ERROR(child->ReadValue(row_idx_, tuple, pool));
> nit: add curly brackets around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@395
PS8, Line 395:   } else CreateChildForSlot(node, slot_desc);
> nit: it's against the Google style guide to have braces on only one branch
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc@330
PS9, Line 330:   for (OrcColumnReader* child : children_)
> Need braces around multi-line for loop - here and elsewhere
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc@331
PS9, Line 331:     RETURN_IF_ERROR(child->ReadValue(row_idx_, tuple, pool));
> This probably performs similarly to the previous code with the switch, but
I think the ORC lib already materializes column values in batch, which 
corresponds to the generation of scratch_batch_ in the parquet scanner. Is it 
optimised enough?


http://gerrit.cloudera.org:8080/#/c/12168/9/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/12168/9/tests/query_test/test_nested_types.py@128
PS9, Line 128:     # TODO(IMPALA-6505): support predicates push down on ORC 
stripe statistics
> Maybe move this to a string argument to skip()
Done



--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Mon, 28 Jan 2019 02:22:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Reply via email to