Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )
Change subject: IMPALA-6503: Support reading complex types from ORC format files ...................................................................... Patch Set 9: (22 comments) Thanks for your feedback! I've added more comments and refactor some codes as required. http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG@25 PS8, Line 25: Don’t like > nit: 'They are not like', or 'They differ from' Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h@175 PS9, Line 175: std::list<uint64_t> selected_type_ids_; > Mention that RowReaderOptions.includeTypes() expects a list (otherwise it's Yeah, I have no ideas about this too. http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h@160 PS8, Line 160: orc::RowReaderOptions row_reader_options; > nit: put '_' at the end of the variable name. Oops! Missed this in the previous patch. Done. http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@181 PS8, Line 181: base > nit: based Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@329 PS8, Line 329: *template_tuple = : Tuple::Create(tuple_desc.byte_size(), template_tuple_pool_.get()); > nit: put curly braces around it Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@364 PS8, Line 364: if (id >= node.getColumnId() && id <= node.getMaximumColumnId()) return true; > nit: put curly braces around it Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@681 PS8, Line 681: RETURN_IF_ERROR(AssembleCollection(*child_reader, child_batch_offset + i, : coll_value_builder)); > nit: add curly braces around it Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@301 PS9, Line 301: VLOG_QUERY << "Add ORC column " << node->getMaximumColumnId() << " for empty tuple " > This logging will be too verbose for most purposes - I'd suggest making it This is quite helpful in debug. I think VLOG_FILE is more suiteable since this function will only be called once for each file. http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363 PS9, Line 363: for (uint64_t id : selected_type_ids) > nit: braces for multi-line if I'm quite confused about the definition of "multi-line". The if statement just occupies one line. Isn't it a single-line statement? Zoltan also comments that the for loop needs a curly brace. Aren't braces optional for single-line statement loops? http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@395 PS9, Line 395: selected_type_ids_.size() > It doesn't really matter here, but with gcc4.9.2 list::size() is actually O Sure http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@680 PS9, Line 680: for (int i = 0; i < total_tuples; ++i) > Use braces for multi-line if Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@19 PS9, Line 19: #ifndef IMPALA_EXEC_ORC_COLUMN_READERS_H > We started using "#pragma once" instead of the traditional include guards i Sure. Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@61 PS9, Line 61: virtual void UpdateInputBatch(orc::ColumnVectorBatch* orc_batch) = 0; > Maybe mention the relationship with ReadValue() - do you need to call it be Sure. Added class comments. Feel free to polish them since they're not written by a native speaker :) http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@101 PS8, Line 101: batch_ = dynamic_cast<orc::LongVectorBatch*>(orc_batch); > We avoid doing dynamic casts in release mode. Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@126 PS8, Line 126: int64_t val = batch_->data.data()[row_idx]; > Maybe you could add a DCHECK that batch_ is not null. Sure. Add DCHECK_NOTNULL at the first use of batch_. Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@190 PS8, Line 190: *slot = TimestampValue::FromUnixTimeNanos(secs, nanos, ORC timestamps are not timezone-aware. Timestamps read/write by the ORC lib are assumed in GMT timezone. We used to have a long debate about this in https://github.com/apache/orc/pull/233 when I encountered a bug of the writter (ORC-320, ORC-322). The short conclusion is updated in the ORC web site (https://orc.apache.org/docs/core-cpp.html): > TimestampVectorBatch handles timestamp values. The data is represented as two > buffers of int64_t for seconds and nanoseconds respectively. Note that we > always assume data is in GMT timezone; therefore it is user’s responsibility > to convert wall clock time from local timezone to GMT. http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@326 PS8, Line 326: children_[c]->UpdateInputBatch(batch_->fields[children_fields_[c]]); > nit: add curly brackets around it since it is a multi-line for stmt. Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@331 PS8, Line 331: RETURN_IF_ERROR(child->ReadValue(row_idx_, tuple, pool)); > nit: add curly brackets around it Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@395 PS8, Line 395: } else CreateChildForSlot(node, slot_desc); > nit: it's against the Google style guide to have braces on only one branch Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc@330 PS9, Line 330: for (OrcColumnReader* child : children_) > Need braces around multi-line for loop - here and elsewhere Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc@331 PS9, Line 331: RETURN_IF_ERROR(child->ReadValue(row_idx_, tuple, pool)); > This probably performs similarly to the previous code with the switch, but I think the ORC lib already materializes column values in batch, which corresponds to the generation of scratch_batch_ in the parquet scanner. Is it optimised enough? http://gerrit.cloudera.org:8080/#/c/12168/9/tests/query_test/test_nested_types.py File tests/query_test/test_nested_types.py: http://gerrit.cloudera.org:8080/#/c/12168/9/tests/query_test/test_nested_types.py@128 PS9, Line 128: # TODO(IMPALA-6505): support predicates push down on ORC stripe statistics > Maybe move this to a string argument to skip() Done -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Mon, 28 Jan 2019 02:22:47 +0000 Gerrit-HasComments: Yes
