Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9134 )
Change subject: IMPALA-5717: Support for reading ORC data files ...................................................................... Patch Set 8: (4 comments) I think we're close to the point where we'd be best off merging this and getting the code into the hands of developers and the community so we can play around and see how it works in practice. As you mentioned, you've run this in production so I'm sure a lot of issues were already flushed out. http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG@19 PS5, Line 19: - Most of the end-to-end tests can run on ORC format. > Sorry that I should have mentioned this in the commit message. I've enabled Haha ok, we can't do too much about that :). The ORC project should thank you for the free QA. Some of the existing file formats (Seq, RC) also have bugs like this with the fuzz test. I was hoping to avoid that from the start here, but we can't do too much about that. Can you create a follow-on JIRA to enable the fuzz test and link the ORC bugs you found? I think we should also add an --enable-orc-scanner feature flag (it can be on-by-default) that fails the query if it tries to scan ORC files, just so admins have a safety valve to disable it until we've flushed out all the issues. http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@72 PS5, Line 72: > We need this as the above reason. Ah makes sense :(. Naming member functions the same thing as a standard library function is pretty annoying! http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@259 PS5, Line 259: VLOG_QUERY << "Unknown compression kind of orc::CompressionKind: " << kind; > The ORC library can read ORC files using ZSTD compression. Here we just mis I had a look and I don't think anything should be broken if you just add it to the end of the enum. Enabling ZSTD support to Parquet, other file formats, etc, requires making changes elsewhere, so ZSTD wouldn't get accidentally enable by adding it to the enum. There's a similar situation with LZO where it is only supported for text/lzo, not Parquet. http://gerrit.cloudera.org:8080/#/c/9134/7/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/9134/7/be/src/exec/hdfs-orc-scanner.cc@77 PS7, Line 77: MemLimitExceeded Maybe ThrowMemLimitExceeded()? Also, there's an extra space here. -- To view, visit http://gerrit.cloudera.org:8080/9134 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Gerrit-Change-Number: 9134 Gerrit-PatchSet: 8 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Wed, 28 Mar 2018 22:02:29 +0000 Gerrit-HasComments: Yes
