[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

Tim Armstrong (Code Review) Wed, 28 Mar 2018 15:03:06 -0700

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )


Change subject: IMPALA-5717: Support for reading ORC data files
......................................................................


Patch Set 8:

(4 comments)

I think we're close to the point where we'd be best off merging this and 
getting the code into the hands of developers and the community so we can play 
around and see how it works in practice.

As you mentioned, you've run this in production so I'm sure a lot of issues 
were already flushed out.

http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG@19
PS5, Line 19:  - Most of the end-to-end tests can run on ORC format.
> Sorry that I should have mentioned this in the commit message. I've enabled
Haha ok, we can't do too much about that :). The ORC project should thank you 
for the free QA. Some of the existing file formats (Seq, RC) also have bugs 
like this with the fuzz test. I was hoping to avoid that from the start here, 
but we can't do too much about that.

Can you create a follow-on JIRA to enable the fuzz test and link the ORC bugs 
you found?

I think we should also add an --enable-orc-scanner feature flag (it can be 
on-by-default) that fails the query if it tries to scan ORC files, just so 
admins have a safety valve to disable it until we've flushed out all the issues.


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@72
PS5, Line 72:
> We need this as the above reason.
Ah makes sense :(. Naming member functions the same thing as a standard library 
function is pretty annoying!


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@259
PS5, Line 259:       VLOG_QUERY << "Unknown compression kind of 
orc::CompressionKind: " << kind;
> The ORC library can read ORC files using ZSTD compression. Here we just mis
I had a look and I don't think anything should be broken if you just add it to 
the end of the enum. Enabling ZSTD support to Parquet, other file formats, etc, 
requires making changes elsewhere, so ZSTD wouldn't get accidentally enable by 
adding it to the enum.

There's a similar situation with LZO where it is only supported for text/lzo, 
not Parquet.


http://gerrit.cloudera.org:8080/#/c/9134/7/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/7/be/src/exec/hdfs-orc-scanner.cc@77
PS7, Line 77: MemLimitExceeded
Maybe ThrowMemLimitExceeded()?

Also, there's an extra space here.



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Wed, 28 Mar 2018 22:02:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

Reply via email to