I'm quite agree with Todd. Shipping it as an experimental feature to users can accelerate its growth. There may be many corner cases that we couldn't imagine and only come from users.
> *1. in-tree versus third-party library* Though I import the ORC source code into my patch, I'm agree that we should start from linking it as an external library (I'm now working on this direction). When the feature is mature and we have enough test coverage, we can happily implement our native support for ORC, without fearing for introducing any bugs. This is the way other systems improve their ORC support. Presto initially supports ORC by importing hive jars. At version 0.80 they implemented the first version of their ORC support: https://prestodb.io/docs/current/release/release-0.80.html#new-hive-orc-reader , and release it with a query-level option to turn off it. This blog lists the benifits of implementing native support for ORC: https://code.facebook.com/posts/370832626374903/even-faster-data-at-the-speed-of-presto-orc/ Spark supports ORC at version 1.4 (ticket: https://issues.apache.org/jira/browse/SPARK-2883 , blog: https://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ ). Recently they released their native support for ORC (ticket: https://issues.apache.org/jira/browse/SPARK-20682 , announcement: https://mail-archives.apache.org/mod_mbox/spark-user/201801.mbox/%3ccafu6yh04+gbe8mmucm3cwmnsptu6mx+sxr5nehpael8zvu2...@mail.gmail.com%3E ). > *2. C++ Exceptions* I think keeping them in the scope of hdfs-orc-scanner.cc is acceptable. > *3. What is the quality bar?* I'll add the fuzz test and more tests for invalid or currupt ORC files. > *4. What is the bar for perf and resource consumption?* I think we'll eventually need to implement our own ORC reader to align the performance of parquet. Currently we can concentrate on the correctness and robustness of this feature. Things for performance improvement might be limit: * Runtime filters (done) * Predicate push down to the stripe level * Codegen at the scanner level * Async IO ? At 2018-02-10 09:41:21, "Todd Lipcon" <t...@cloudera.com> wrote: >On Fri, Feb 9, 2018 at 5:05 PM, Tim Armstrong <tarmstr...@cloudera.com> >wrote: > >> Quanlong has done a bunch of work implementing an ORC scanner. I've been >> playing around with it and it works pretty nicely - I can load and run >> TPC-H with no problem! >> >> It's a big addition to Impala and the integration with the external library >> has caused some implementation challenges, so I wanted to summarise the >> issues so that other community members can weigh in. >> >> *1. in-tree versus third-party library* >> The current patchset imports the ORC source code into the Impala source >> tree, rather than treating it as an external library. >> >> My feeling is that, initially, we don't want to take on the burden of >> maintaining a fork of the library, so we're best off keeping it external. >> This way we can upgrade the library to pick up new ORC features instead of >> having to bring the changes into the Impala codebase. I do think we should >> generally try to collaborate with other Apache projects and contribute back >> improvements where possible, instead of forking their code. >> >> We could re-evaluate this at some point, but reducing the maintenance >> burden for now seems most important. >> >> *2. C++ Exceptions* >> The ORC library relies on exceptions for error handling. Our coding style >> guide disallows exceptions, but we do have to deal with them in a few >> places interfacing with external libraries like boost (which sucks). If >> we're interfacing with the library, I don't see how we can avoid throwing >> and catching exceptions at the boundaries with the library, which is what >> Quanlong's patch does. I looked at the ORC source code and the exceptions >> seem fairly pervasive. >> >> My feeling is that we can live with this, provided that we're very careful >> to catch exceptions at the library boundary and it is only >> hdfs-orc-scanner.cc that has to deal with the exceptions. >> >> *3. What is the quality bar?* >> We definitely want all the functional tests to pass, same as other file >> formats like Avro. I also asked Quanlong to add it to the fuzz test to get >> that extra coverage. It would be helpful if others could confirm that we >> have enough test coverage. >> >> *4. What is the bar for perf and resource consumption?* >> The initial version of the scanner won't match Parquet in these categories, >> mainly because it isn't tightly integrated with the Impala runtime and is >> missing a few things like codegen. There may be some limits to how far we >> can improve this without modifying the ORC library. >> > >For quality, perf, etc, I think it would be reasonable to ship it for a >release or two as an explicitly labeled experimental feature. I think the >experience in the past is that most file format integrations appear to work >with synthetic datasets but have new and exciting failures and >compatibility issues in the wild once run against the variety of actual >implementations that produce the format. > >One thing we have found helpful in the Kudu community is to actually >enforce the "experimental" nature of new features for a couple of releases. >We do this by having a gflag called --unlock-experimental-flags. Users who >decide to flip that on are explicitly acknowledging that they're treading >on some unproven ground and they might get crashes, correctness issues, etc. > >Of course the goal should be that, if after a release or two of use the >feedback from users is that it works well and few issues have been found, >I'd expect it to be marked non-experimental. > >-Todd >-- >Todd Lipcon >Software Engineer, Cloudera