Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
......................................................................


Patch Set 5:

(36 comments)

Updates: I'm currently working on refactoring the patch. Things that haven't 
finished:
* add test coverage for types mismatch
* refactor nested types test
* shrink the size of orc test files
* refactor RuntimeEval codes

Hopefully, tomorrow I may be able to upload a new patch.

http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG@22
PS5, Line 22:  - Have passed all the tests.
> We're also missing test coverage to CHAR and VARCHAR. That looks like it wa
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.h@91
PS5, Line 91:   virtual Status Open(ScannerContext *context) WARN_UNUSED_RESULT;
> Can we add override to the functions where it fits?
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.h@141
PS5, Line 141: map
> Could this be unordered_map? We generally prefer that that unless there's a
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.h@177
PS5, Line 177: inline
> My intuition is that many of these "inline" specifiers will not be useful,
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@517
PS3, Line 517:
> Thanks for filing it. I did spent a little time reading the ORC code and it
Yeah, I just heard from Alibaba that they implemented an async InputStream to 
achieve this in their own branch. It's a pity that the ORC community is with 
slow responses. So they haven't planned to contribute back yet.
I think it's not hard to implement such an InputStream. I can have a try after 
this patch.


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@59
PS5, Line 59: Clear
> Let's call this FreeAll() to match the equivalent method on MemPool. MemPoo
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@62
PS5, Line 62:  std::
> std:: prefix not needed for free().
This class has a free function and a malloc function too. To avoid name 
conflict we need the "std::" prefix.


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@71
PS5, Line 71:   mem_tracker_->Consume(size);
> We should do a TryConsume() here so that we don't go over the memory limit.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@72
PS5, Line 72: (std::m
> std:: prefix not needed for malloc
We need this as the above reason.


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@77
PS5, Line 77:     throw std::runtime_error("memory limit exceeded");
> We could do the same status-wrapping thing here.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@90
PS5, Line 90:   if (ImpaladMetrics::MEM_POOL_TOTAL_BYTES != nullptr) {
> Null check shouldn't be needed, right? We don't have it above.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@182
PS5, Line 182:
> Remove extra blank line. There is a bit too much vertical whitespace (i.e.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@208
PS5, Line 208:  &&
             :       reader_->getCompression() != 
orc::CompressionKind::CompressionKind_NONE
> CompressionKind_NONE is already handled in TranslateCompressionKind, right?
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@229
PS5, Line 229: Encounter
> Encountered
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@230
PS5, Line 230: Encounter
> Encountered.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@233
PS5, Line 233:
> We can remove most of the blank lines in this function.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@234
PS5, Line 234:   if (reader_->getNumberOfRows() == 0) {
> Nit: can fit if() and return on one line.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@240
PS5, Line 240:     return Status(Substitute("Invalid file: $0. No stripes in 
this file but numberOfRows"
> "Invalid ORC file"
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@250
PS5, Line 250:     case orc::CompressionKind::CompressionKind_NONE:return 
THdfsCompression::NONE;
> Nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@259
PS5, Line 259:       VLOG_QUERY << "ORC files in ZSTD compression are counted 
as DEFAULT in profile";
> We should return the error in a Status if the compression type isn't handle
The ORC library can read ORC files using ZSTD compression. Here we just 
miscount ORC/ZSTD as ORC/DEFAULT in the profile, since we don't have a ZSTD 
value in THdfsCompression::type. I'm not sure if other codes will be broken 
when I add ZSTD in the enum just for this, so I leave a TODO here.


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@281
PS5, Line 281: nullptr
> NULL, not nullptr, since this is just a null indicator bit.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@291
PS5, Line 291:     
col_id_slot_map_[reader_->getType().getSubtype(col_idx_in_file)->getColumnId()] 
= slot_desc;
> nit: long line
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@292
PS5, Line 292:     const ColumnType &col_type = 
scan_node_->hdfs_table()->col_descs()[col_idx].type();
> We should validate that the ORC and Impala columns types are compatible at
Good point! Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@306
PS5, Line 306: std::m
> nit: std:: should not be necessary since this is imported by names.h
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@388
PS5, Line 388:
> unnecessary blank line
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@489
PS5, Line 489:
             :           VLOG_QUERY << "Encounter parse error: " << e.what()
             :                      << " which is due to cancellation";
> We don't need to log anything if we're cancelling normally. Why do we need
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@552
PS5, Line 552: VLOG_FILE
> Logging once per batch in VLOG_FILE seems too verbose. Remove or move to VL
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@590
PS5, Line 590: ColumnVectorBatch
> I think the argument to this function should be a StructVectorBatch instead
To change this, we have to change the type of scratch_batch_ to 
unique_ptr<orc::StructVectorBatch*>. However, orc::RowReader::createRowBatch 
returns unique_ptr<orc::ColumnVectorBatch*>. It's not quite elegant to convert 
a unique_ptr<orc::ColumnVectorBatch*> to unique_ptr<orc::StructVectorBatch*>. 
Let's allow this to make the caller simple.


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@598
PS5, Line 598:     if (slot_desc == nullptr) {
> Can this be a DCHECK? Isn't this a bug in our code if the slot doesn't exis
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@645
PS5, Line 645:           case 
TYPE_FLOAT:*(reinterpret_cast<float*>(slot_val_ptr)) = val;
> needs space after :
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@647
PS5, Line 647:           case 
TYPE_DOUBLE:*(reinterpret_cast<double*>(slot_val_ptr)) = val;
> needs space after :
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@657
PS5, Line 657:       case orc::TypeKind::CHAR: {
> I don't think this handles CHAR(N) or VARCHAR(N) types in Impala correctly.
Done. Fix this and add coverage in test_chars.py


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@670
PS5, Line 670:           return 
scan_node_->mem_tracker()->MemLimitExceeded(state_, details, val_ptr->len);
> long line
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@693
PS5, Line 693:           // TODO warn if slot_desc->type().GetByteSize() != 16
> I think we should validate the relationship between Impala and ORC types wh
Sure, trancating a decimal makes no sense.


http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/bin/run-hive-server.sh
File testdata/bin/run-hive-server.sh:

http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/bin/run-hive-server.sh@75
PS5, Line 75: tabls
> typo: tables
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv
File testdata/workloads/functional-query/functional-query_exhaustive.csv:

http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv@25
PS3, Line 25: file_format: orc, dataset: functional, compression_codec: none, 
compression_type: none
> We should the name of compression_codec to be deflate for our ORC tables so
Done



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Tue, 20 Mar 2018 13:23:18 +0000
Gerrit-HasComments: Yes

Reply via email to