Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/9134 )
Change subject: IMPALA-5717: Support for reading ORC data files ...................................................................... Patch Set 5: (36 comments) Updates: I'm currently working on refactoring the patch. Things that haven't finished: * add test coverage for types mismatch * refactor nested types test * shrink the size of orc test files * refactor RuntimeEval codes Hopefully, tomorrow I may be able to upload a new patch. http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG@22 PS5, Line 22: - Have passed all the tests. > We're also missing test coverage to CHAR and VARCHAR. That looks like it wa Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.h@91 PS5, Line 91: virtual Status Open(ScannerContext *context) WARN_UNUSED_RESULT; > Can we add override to the functions where it fits? Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.h@141 PS5, Line 141: map > Could this be unordered_map? We generally prefer that that unless there's a Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.h@177 PS5, Line 177: inline > My intuition is that many of these "inline" specifiers will not be useful, Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@517 PS3, Line 517: > Thanks for filing it. I did spent a little time reading the ORC code and it Yeah, I just heard from Alibaba that they implemented an async InputStream to achieve this in their own branch. It's a pity that the ORC community is with slow responses. So they haven't planned to contribute back yet. I think it's not hard to implement such an InputStream. I can have a try after this patch. http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@59 PS5, Line 59: Clear > Let's call this FreeAll() to match the equivalent method on MemPool. MemPoo Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@62 PS5, Line 62: std:: > std:: prefix not needed for free(). This class has a free function and a malloc function too. To avoid name conflict we need the "std::" prefix. http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@71 PS5, Line 71: mem_tracker_->Consume(size); > We should do a TryConsume() here so that we don't go over the memory limit. Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@72 PS5, Line 72: (std::m > std:: prefix not needed for malloc We need this as the above reason. http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@77 PS5, Line 77: throw std::runtime_error("memory limit exceeded"); > We could do the same status-wrapping thing here. Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@90 PS5, Line 90: if (ImpaladMetrics::MEM_POOL_TOTAL_BYTES != nullptr) { > Null check shouldn't be needed, right? We don't have it above. Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@182 PS5, Line 182: > Remove extra blank line. There is a bit too much vertical whitespace (i.e. Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@208 PS5, Line 208: && : reader_->getCompression() != orc::CompressionKind::CompressionKind_NONE > CompressionKind_NONE is already handled in TranslateCompressionKind, right? Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@229 PS5, Line 229: Encounter > Encountered Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@230 PS5, Line 230: Encounter > Encountered. Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@233 PS5, Line 233: > We can remove most of the blank lines in this function. Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@234 PS5, Line 234: if (reader_->getNumberOfRows() == 0) { > Nit: can fit if() and return on one line. Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@240 PS5, Line 240: return Status(Substitute("Invalid file: $0. No stripes in this file but numberOfRows" > "Invalid ORC file" Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@250 PS5, Line 250: case orc::CompressionKind::CompressionKind_NONE:return THdfsCompression::NONE; > Nit: missing space Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@259 PS5, Line 259: VLOG_QUERY << "ORC files in ZSTD compression are counted as DEFAULT in profile"; > We should return the error in a Status if the compression type isn't handle The ORC library can read ORC files using ZSTD compression. Here we just miscount ORC/ZSTD as ORC/DEFAULT in the profile, since we don't have a ZSTD value in THdfsCompression::type. I'm not sure if other codes will be broken when I add ZSTD in the enum just for this, so I leave a TODO here. http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@281 PS5, Line 281: nullptr > NULL, not nullptr, since this is just a null indicator bit. Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@291 PS5, Line 291: col_id_slot_map_[reader_->getType().getSubtype(col_idx_in_file)->getColumnId()] = slot_desc; > nit: long line Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@292 PS5, Line 292: const ColumnType &col_type = scan_node_->hdfs_table()->col_descs()[col_idx].type(); > We should validate that the ORC and Impala columns types are compatible at Good point! Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@306 PS5, Line 306: std::m > nit: std:: should not be necessary since this is imported by names.h Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@388 PS5, Line 388: > unnecessary blank line Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@489 PS5, Line 489: : VLOG_QUERY << "Encounter parse error: " << e.what() : << " which is due to cancellation"; > We don't need to log anything if we're cancelling normally. Why do we need Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@552 PS5, Line 552: VLOG_FILE > Logging once per batch in VLOG_FILE seems too verbose. Remove or move to VL Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@590 PS5, Line 590: ColumnVectorBatch > I think the argument to this function should be a StructVectorBatch instead To change this, we have to change the type of scratch_batch_ to unique_ptr<orc::StructVectorBatch*>. However, orc::RowReader::createRowBatch returns unique_ptr<orc::ColumnVectorBatch*>. It's not quite elegant to convert a unique_ptr<orc::ColumnVectorBatch*> to unique_ptr<orc::StructVectorBatch*>. Let's allow this to make the caller simple. http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@598 PS5, Line 598: if (slot_desc == nullptr) { > Can this be a DCHECK? Isn't this a bug in our code if the slot doesn't exis Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@645 PS5, Line 645: case TYPE_FLOAT:*(reinterpret_cast<float*>(slot_val_ptr)) = val; > needs space after : Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@647 PS5, Line 647: case TYPE_DOUBLE:*(reinterpret_cast<double*>(slot_val_ptr)) = val; > needs space after : Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@657 PS5, Line 657: case orc::TypeKind::CHAR: { > I don't think this handles CHAR(N) or VARCHAR(N) types in Impala correctly. Done. Fix this and add coverage in test_chars.py http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@670 PS5, Line 670: return scan_node_->mem_tracker()->MemLimitExceeded(state_, details, val_ptr->len); > long line Done http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@693 PS5, Line 693: // TODO warn if slot_desc->type().GetByteSize() != 16 > I think we should validate the relationship between Impala and ORC types wh Sure, trancating a decimal makes no sense. http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/bin/run-hive-server.sh File testdata/bin/run-hive-server.sh: http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/bin/run-hive-server.sh@75 PS5, Line 75: tabls > typo: tables Done http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv File testdata/workloads/functional-query/functional-query_exhaustive.csv: http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv@25 PS3, Line 25: file_format: orc, dataset: functional, compression_codec: none, compression_type: none > We should the name of compression_codec to be deflate for our ORC tables so Done -- To view, visit http://gerrit.cloudera.org:8080/9134 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Gerrit-Change-Number: 9134 Gerrit-PatchSet: 5 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Tue, 20 Mar 2018 13:23:18 +0000 Gerrit-HasComments: Yes