Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files

Patch Set 3:


Thanks, Tim! Your comments are really useful!
If we finally decide to move the ORC library into native-tool-chain project, is 
there any document about how to contribute to this? I think I may need the ORC 
library merged first than I can use it like other tools.

There're still comments I haven't deal with. Forgive me to reply them later.

File be/src/exec/hdfs-orc-scanner.h:

PS3, Line 28: class CollectionValueBuilder;
> Not needed?
Yeah, just added it when I try to support complex types. Will remove it.

PS3, Line 259: ProcessFileTail
> It might be helpful to define what the "footer", "file tail" and "postscrip
I was confused as well at first :)
They're concepts in ORC. Here is their definitions: 
You can also find them in be/src/orc/orc_proto.proto

File be/src/exec/hdfs-orc-scanner.cc:

PS3, Line 1: // Copyright 2012 Cloudera Inc.
> Don't need cloudera copyrights!

PS3, Line 24: #include "common/object-pool.h"
> Many of these headers look unused.
Yes, will remove them

PS3, Line 53: using boost::algorithm::split;
> Some of these boost "using" declarations don't seem to be needed.

PS3, Line 59: DEFINE_double(orc_min_filter_reject_ratio, 0.1, "(Advanced) If 
the percentage of "
> I don't know why we made this flag option parquet-specific. Having an optio
agree with you. There're many logics in the parquet scanner that can share with 
the ORC scanner. Not only this var, but also functions like IssueInitialRanges 
and FindFooterSplit.

PS3, Line 73:   for (int i = 0; i < files.size(); ++i) {
> Can we convert this to a range for? We generally prefer that in new code.

PS3, Line 151:   mem_tracker_->CloseAndUnregisterFromParent();
> We only want to use CloseAndUnregisterFromParent() for the query-level MemT
Just add the comment since I made it crashed when use Close at first...
I can remove the comment if you're all clear of it :)

PS3, Line 157: std::
> Don't need std:: prefix. We generally prefer avoiding it when it isn't need
We need this prefix because this class has a free function (see below) as well.

PS3, Line 162:   if (ImpaladMetrics::MEM_POOL_TOTAL_BYTES != nullptr) {
> You should be able to assume it's non-NULL. A lot of older code checks if m
So the non-NULL checks in mem-pool.cc are redundant too? I learn this from the 
impala::MemPool implememtation.

I found this metric useful when I ran test_failpoints.py individually. It won't 
come back to zero in 2 minutes. So I found bug mentioned in IMPALA-6423

PS3, Line 196: void HdfsOrcScanner::ScanRangeInputStream::read(void* buf, 
uint64_t length,
> It's unfortunate that the ORC code was designed to issue only synchronous r
yes, quite a pity. So I hope we can include the ORC codes and modify the logics 
of getting input.

PS3, Line 223:     memset(buf, 0, length);
> Is the memset needed? If so, should document why in a comment.
Just let the orc-reader to throw an exception for parse error if it read this 
later. Not needed actually since we throw the following exception immediately.

PS3, Line 224:     throw std::runtime_error("Cannot read from file: " + 
> The fact we need to use exceptions is unfortunate. I need to think a bit mo
I don't like exceptions as well. One solution is to insert the stream context 
into the orc-reader, and check if cancel in the loops inside it.

This need to modify codes in the ORC-reader so I haven't started yet. If you 
guys decide to include the ORC codes, I can implement this.

PS3, Line 230:       stripe_idx_(-1),
> For the constants here, we should initialise them in the class definition (

PS3, Line 257:   num_cols_counter_ =
> What do you think about adding Orc to some of the counter names? At least t
Good idea! That would be more clear.

PS3, Line 259:   num_stats_filtered_stripes_counter_ =
> This isn't used - let's remove it until we add the stats filtering feature.

PS3, Line 333:     if (col_type.type == TYPE_ARRAY) {
> Are there any tests to make sure the ORC reader won't crash when scanning c
Complex types will be skipped here. Their column ids won't be set into the 
ReaderOptions. However, I'll add a test for this.

Sure, will create more JIRAs.

PS3, Line 677:   vector<uint8_t> decompressed_footer_buffer;
> I'm concerned that this buffer memory isn't tracked and could be arbitraril
You're right! Will review these logics deeper.

PS3, Line 679: isCompressed
> is_compressed

PS3, Line 884: dynamic_cast
> Can this be a static_cast? AFAICT dynamic_cast is unnecessary in most place
Sure! Excited to know that you start to test the scanner's performance!

PS3, Line 913:           case 
TYPE_TINYINT:*(reinterpret_cast<int8_t*>(slot_val_ptr)) = val;
> It might be more readable to put the case body on a separate line.

PS3, Line 954:             
> Should use TryAllocate() here and return an error on failure to allocate. A

PS3, Line 1038: NULL
> Prefer nullptr in new code (I know NULL is all over the existing code :)).
Sure, forgot this one

File testdata/bin/run-hive-server.sh:

PS3, Line 75:   HADOOP_HEAPSIZE="2048" hive --service hiveserver2 > 
${LOGDIR}/hive-server2.out 2>&1 &
> Can you comment how 2048 was chosen for future reference?
I increase it to avoid OOM when loading tables like widerow. 1024 is not enough 

To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Fri, 02 Feb 2018 15:21:34 +0000
Gerrit-HasComments: Yes

Reply via email to