Re:Re: ORC scanner - points for discussion

Quanlong Huang Sun, 11 Feb 2018 05:52:38 -0800

I'm quite agree with Todd. Shipping it as an experimental feature to users can 
accelerate its growth. There may be many corner cases that we couldn't imagine 
and only come from users.

> *1. in-tree versus third-party library*

Though I import the ORC source code into my patch, I'm agree that we should 
start from linking it as an external library (I'm now working on this 
direction). When the feature is mature and we have enough test coverage, we can 
happily implement our native support for ORC, without fearing for introducing 
any bugs. This is the way other systems improve their ORC support.
Presto initially supports ORC by importing hive jars. At version 0.80 they 
implemented the first version of their ORC support: 
https://prestodb.io/docs/current/release/release-0.80.html#new-hive-orc-reader 
, and release it with a query-level option to turn off it. This blog lists the 
benifits of implementing native support for ORC: 
https://code.facebook.com/posts/370832626374903/even-faster-data-at-the-speed-of-presto-orc/
Spark supports ORC at version 1.4 (ticket: 
https://issues.apache.org/jira/browse/SPARK-2883 , blog: 
https://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ ). 
Recently they released their native support for ORC (ticket: 
https://issues.apache.org/jira/browse/SPARK-20682 , announcement: 
https://mail-archives.apache.org/mod_mbox/spark-user/201801.mbox/%3ccafu6yh04+gbe8mmucm3cwmnsptu6mx+sxr5nehpael8zvu2...@mail.gmail.com%3E
 ).

> *2. C++ Exceptions*

I think keeping them in the scope of hdfs-orc-scanner.cc is acceptable.

> *3. What is the quality bar?*

I'll add the fuzz test and more tests for invalid or currupt ORC files.

> *4. What is the bar for perf and resource consumption?*

I think we'll eventually need to implement our own ORC reader to align the 
performance of parquet. Currently we can concentrate on the correctness and 
robustness of this feature. Things for performance improvement might be limit:
* Runtime filters (done)
* Predicate push down to the stripe level
* Codegen at the scanner level
* Async IO ?

At 2018-02-10 09:41:21, "Todd Lipcon" <[email protected]> wrote: >On Fri, Feb 
9, 2018 at 5:05 PM, Tim Armstrong <[email protected]> >wrote: > >> 
Quanlong has done a bunch of work implementing an ORC scanner. I've been >> 
playing around with it and it works pretty nicely - I can load and run >> TPC-H 
with no problem! >> >> It's a big addition to Impala and the integration with 
the external library >> has caused some implementation challenges, so I wanted 
to summarise the >> issues so that other community members can weigh in. >> >> 
*1. in-tree versus third-party library* >> The current patchset imports the ORC 
source code into the Impala source >> tree, rather than treating it as an 
external library. >> >> My feeling is that, initially, we don't want to take on 
the burden of >> maintaining a fork of the library, so we're best off keeping 
it external. >> This way we can upgrade the library to pick up new ORC features 
instead of >> having to bring the changes into the Impala codebase. I do think 
we should >> generally try to collaborate with other Apache projects and 
contribute back >> improvements where possible, instead of forking their code. 
>> >> We could re-evaluate this at some point, but reducing the maintenance >> 
burden for now seems most important. >> >> *2. C++ Exceptions* >> The ORC 
library relies on exceptions for error handling. Our coding style >> guide 
disallows exceptions, but we do have to deal with them in a few >> places 
interfacing with external libraries like boost (which sucks). If >> we're 
interfacing with the library, I don't see how we can avoid throwing >> and 
catching exceptions at the boundaries with the library, which is what >> 
Quanlong's patch does. I looked at the ORC source code and the exceptions >> 
seem fairly pervasive. >> >> My feeling is that we can live with this, provided 
that we're very careful >> to catch exceptions at the library boundary and it 
is only >> hdfs-orc-scanner.cc that has to deal with the exceptions. >> >> *3. 
What is the quality bar?* >> We definitely want all the functional tests to 
pass, same as other file >> formats like Avro. I also asked Quanlong to add it 
to the fuzz test to get >> that extra coverage. It would be helpful if others 
could confirm that we >> have enough test coverage. >> >> *4. What is the bar 
for perf and resource consumption?* >> The initial version of the scanner won't 
match Parquet in these categories, >> mainly because it isn't tightly 
integrated with the Impala runtime and is >> missing a few things like codegen. 
There may be some limits to how far we >> can improve this without modifying 
the ORC library. >> > >For quality, perf, etc, I think it would be reasonable 
to ship it for a >release or two as an explicitly labeled experimental feature. 
I think the >experience in the past is that most file format integrations 
appear to work >with synthetic datasets but have new and exciting failures and 
>compatibility issues in the wild once run against the variety of actual 
>implementations that produce the format. > >One thing we have found helpful in 
the Kudu community is to actually >enforce the "experimental" nature of new 
features for a couple of releases. >We do this by having a gflag called 
--unlock-experimental-flags. Users who >decide to flip that on are explicitly 
acknowledging that they're treading >on some unproven ground and they might get 
crashes, correctness issues, etc. > >Of course the goal should be that, if 
after a release or two of use the >feedback from users is that it works well 
and few issues have been found, >I'd expect it to be marked non-experimental. > 
>-Todd >-- >Todd Lipcon >Software Engineer, Cloudera

Re:Re: ORC scanner - points for discussion

Reply via email to