Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files
......................................................................


Patch Set 3:

(24 comments)

Thanks, Tim! Your comments are really useful!
If we finally decide to move the ORC library into native-tool-chain project, is 
there any document about how to contribute to this? I think I may need the ORC 
library merged first than I can use it like other tools.

There're still comments I haven't deal with. Forgive me to reply them later.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h@28
PS3, Line 28: class CollectionValueBuilder;
> Not needed?
Yeah, just added it when I try to support complex types. Will remove it.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h@259
PS3, Line 259: ProcessFileTail
> It might be helpful to define what the "footer", "file tail" and "postscrip
I was confused as well at first :)
They're concepts in ORC. Here is their definitions: 
https://orc.apache.org/docs/file-tail.html
You can also find them in be/src/orc/orc_proto.proto


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@1
PS3, Line 1: // Copyright 2012 Cloudera Inc.
> Don't need cloudera copyrights!
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@24
PS3, Line 24: #include "common/object-pool.h"
> Many of these headers look unused.
Yes, will remove them


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@53
PS3, Line 53: using boost::algorithm::split;
> Some of these boost "using" declarations don't seem to be needed.
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@59
PS3, Line 59: DEFINE_double(orc_min_filter_reject_ratio, 0.1, "(Advanced) If 
the percentage of "
> I don't know why we made this flag option parquet-specific. Having an optio
agree with you. There're many logics in the parquet scanner that can share with 
the ORC scanner. Not only this var, but also functions like IssueInitialRanges 
and FindFooterSplit.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@73
PS3, Line 73:   for (int i = 0; i < files.size(); ++i) {
> Can we convert this to a range for? We generally prefer that in new code.
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@151
PS3, Line 151:   mem_tracker_->CloseAndUnregisterFromParent();
> We only want to use CloseAndUnregisterFromParent() for the query-level MemT
Just add the comment since I made it crashed when use Close at first...
I can remove the comment if you're all clear of it :)


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@157
PS3, Line 157: std::
> Don't need std:: prefix. We generally prefer avoiding it when it isn't need
We need this prefix because this class has a free function (see below) as well.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@162
PS3, Line 162:   if (ImpaladMetrics::MEM_POOL_TOTAL_BYTES != nullptr) {
> You should be able to assume it's non-NULL. A lot of older code checks if m
So the non-NULL checks in mem-pool.cc are redundant too? I learn this from the 
impala::MemPool implememtation.

I found this metric useful when I ran test_failpoints.py individually. It won't 
come back to zero in 2 minutes. So I found bug mentioned in IMPALA-6423


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@196
PS3, Line 196: void HdfsOrcScanner::ScanRangeInputStream::read(void* buf, 
uint64_t length,
> It's unfortunate that the ORC code was designed to issue only synchronous r
yes, quite a pity. So I hope we can include the ORC codes and modify the logics 
of getting input.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@223
PS3, Line 223:     memset(buf, 0, length);
> Is the memset needed? If so, should document why in a comment.
Just let the orc-reader to throw an exception for parse error if it read this 
later. Not needed actually since we throw the following exception immediately.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@224
PS3, Line 224:     throw std::runtime_error("Cannot read from file: " + 
status.GetDetail());
> The fact we need to use exceptions is unfortunate. I need to think a bit mo
I don't like exceptions as well. One solution is to insert the stream context 
into the orc-reader, and check if cancel in the loops inside it.

This need to modify codes in the ORC-reader so I haven't started yet. If you 
guys decide to include the ORC codes, I can implement this.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@230
PS3, Line 230:       stripe_idx_(-1),
> For the constants here, we should initialise them in the class definition (
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@257
PS3, Line 257:   num_cols_counter_ =
> What do you think about adding Orc to some of the counter names? At least t
Good idea! That would be more clear.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@259
PS3, Line 259:   num_stats_filtered_stripes_counter_ =
> This isn't used - let's remove it until we add the stats filtering feature.
Sure


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@333
PS3, Line 333:     if (col_type.type == TYPE_ARRAY) {
> Are there any tests to make sure the ORC reader won't crash when scanning c
Complex types will be skipped here. Their column ids won't be set into the 
ReaderOptions. However, I'll add a test for this.

Sure, will create more JIRAs.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@677
PS3, Line 677:   vector<uint8_t> decompressed_footer_buffer;
> I'm concerned that this buffer memory isn't tracked and could be arbitraril
You're right! Will review these logics deeper.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@679
PS3, Line 679: isCompressed
> is_compressed
done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@884
PS3, Line 884: dynamic_cast
> Can this be a static_cast? AFAICT dynamic_cast is unnecessary in most place
Sure! Excited to know that you start to test the scanner's performance!


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@913
PS3, Line 913:           case 
TYPE_TINYINT:*(reinterpret_cast<int8_t*>(slot_val_ptr)) = val;
> It might be more readable to put the case body on a separate line.
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@954
PS3, Line 954:             
dst_batch->tuple_data_pool()->Allocate(val_ptr->len));
> Should use TryAllocate() here and return an error on failure to allocate. A
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@1038
PS3, Line 1038: NULL
> Prefer nullptr in new code (I know NULL is all over the existing code :)).
Sure, forgot this one


http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/bin/run-hive-server.sh
File testdata/bin/run-hive-server.sh:

http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/bin/run-hive-server.sh@75
PS3, Line 75:   HADOOP_HEAPSIZE="2048" hive --service hiveserver2 > 
${LOGDIR}/hive-server2.out 2>&1 &
> Can you comment how 2048 was chosen for future reference?
I increase it to avoid OOM when loading tables like widerow. 1024 is not enough 
though.



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Fri, 02 Feb 2018 15:21:34 +0000
Gerrit-HasComments: Yes

Reply via email to