Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/9134 )
Change subject: IMPALA-5717: Support for ORC data files ...................................................................... Patch Set 3: (24 comments) Thanks, Tim! Your comments are really useful! If we finally decide to move the ORC library into native-tool-chain project, is there any document about how to contribute to this? I think I may need the ORC library merged first than I can use it like other tools. There're still comments I haven't deal with. Forgive me to reply them later. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h@28 PS3, Line 28: class CollectionValueBuilder; > Not needed? Yeah, just added it when I try to support complex types. Will remove it. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h@259 PS3, Line 259: ProcessFileTail > It might be helpful to define what the "footer", "file tail" and "postscrip I was confused as well at first :) They're concepts in ORC. Here is their definitions: https://orc.apache.org/docs/file-tail.html You can also find them in be/src/orc/orc_proto.proto http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@1 PS3, Line 1: // Copyright 2012 Cloudera Inc. > Don't need cloudera copyrights! Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@24 PS3, Line 24: #include "common/object-pool.h" > Many of these headers look unused. Yes, will remove them http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@53 PS3, Line 53: using boost::algorithm::split; > Some of these boost "using" declarations don't seem to be needed. Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@59 PS3, Line 59: DEFINE_double(orc_min_filter_reject_ratio, 0.1, "(Advanced) If the percentage of " > I don't know why we made this flag option parquet-specific. Having an optio agree with you. There're many logics in the parquet scanner that can share with the ORC scanner. Not only this var, but also functions like IssueInitialRanges and FindFooterSplit. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@73 PS3, Line 73: for (int i = 0; i < files.size(); ++i) { > Can we convert this to a range for? We generally prefer that in new code. Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@151 PS3, Line 151: mem_tracker_->CloseAndUnregisterFromParent(); > We only want to use CloseAndUnregisterFromParent() for the query-level MemT Just add the comment since I made it crashed when use Close at first... I can remove the comment if you're all clear of it :) http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@157 PS3, Line 157: std:: > Don't need std:: prefix. We generally prefer avoiding it when it isn't need We need this prefix because this class has a free function (see below) as well. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@162 PS3, Line 162: if (ImpaladMetrics::MEM_POOL_TOTAL_BYTES != nullptr) { > You should be able to assume it's non-NULL. A lot of older code checks if m So the non-NULL checks in mem-pool.cc are redundant too? I learn this from the impala::MemPool implememtation. I found this metric useful when I ran test_failpoints.py individually. It won't come back to zero in 2 minutes. So I found bug mentioned in IMPALA-6423 http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@196 PS3, Line 196: void HdfsOrcScanner::ScanRangeInputStream::read(void* buf, uint64_t length, > It's unfortunate that the ORC code was designed to issue only synchronous r yes, quite a pity. So I hope we can include the ORC codes and modify the logics of getting input. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@223 PS3, Line 223: memset(buf, 0, length); > Is the memset needed? If so, should document why in a comment. Just let the orc-reader to throw an exception for parse error if it read this later. Not needed actually since we throw the following exception immediately. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@224 PS3, Line 224: throw std::runtime_error("Cannot read from file: " + status.GetDetail()); > The fact we need to use exceptions is unfortunate. I need to think a bit mo I don't like exceptions as well. One solution is to insert the stream context into the orc-reader, and check if cancel in the loops inside it. This need to modify codes in the ORC-reader so I haven't started yet. If you guys decide to include the ORC codes, I can implement this. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@230 PS3, Line 230: stripe_idx_(-1), > For the constants here, we should initialise them in the class definition ( Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@257 PS3, Line 257: num_cols_counter_ = > What do you think about adding Orc to some of the counter names? At least t Good idea! That would be more clear. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@259 PS3, Line 259: num_stats_filtered_stripes_counter_ = > This isn't used - let's remove it until we add the stats filtering feature. Sure http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@333 PS3, Line 333: if (col_type.type == TYPE_ARRAY) { > Are there any tests to make sure the ORC reader won't crash when scanning c Complex types will be skipped here. Their column ids won't be set into the ReaderOptions. However, I'll add a test for this. Sure, will create more JIRAs. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@677 PS3, Line 677: vector<uint8_t> decompressed_footer_buffer; > I'm concerned that this buffer memory isn't tracked and could be arbitraril You're right! Will review these logics deeper. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@679 PS3, Line 679: isCompressed > is_compressed done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@884 PS3, Line 884: dynamic_cast > Can this be a static_cast? AFAICT dynamic_cast is unnecessary in most place Sure! Excited to know that you start to test the scanner's performance! http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@913 PS3, Line 913: case TYPE_TINYINT:*(reinterpret_cast<int8_t*>(slot_val_ptr)) = val; > It might be more readable to put the case body on a separate line. Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@954 PS3, Line 954: dst_batch->tuple_data_pool()->Allocate(val_ptr->len)); > Should use TryAllocate() here and return an error on failure to allocate. A Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@1038 PS3, Line 1038: NULL > Prefer nullptr in new code (I know NULL is all over the existing code :)). Sure, forgot this one http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/bin/run-hive-server.sh File testdata/bin/run-hive-server.sh: http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/bin/run-hive-server.sh@75 PS3, Line 75: HADOOP_HEAPSIZE="2048" hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 & > Can you comment how 2048 was chosen for future reference? I increase it to avoid OOM when loading tables like widerow. 1024 is not enough though. -- To view, visit http://gerrit.cloudera.org:8080/9134 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Gerrit-Change-Number: 9134 Gerrit-PatchSet: 3 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Fri, 02 Feb 2018 15:21:34 +0000 Gerrit-HasComments: Yes
