On Fri, Feb 9, 2018 at 5:05 PM, Tim Armstrong <tarmstr...@cloudera.com>
> Quanlong has done a bunch of work implementing an ORC scanner. I've been
> playing around with it and it works pretty nicely - I can load and run
> TPC-H with no problem!
> It's a big addition to Impala and the integration with the external library
> has caused some implementation challenges, so I wanted to summarise the
> issues so that other community members can weigh in.
> *1. in-tree versus third-party library*
> The current patchset imports the ORC source code into the Impala source
> tree, rather than treating it as an external library.
> My feeling is that, initially, we don't want to take on the burden of
> maintaining a fork of the library, so we're best off keeping it external.
> This way we can upgrade the library to pick up new ORC features instead of
> having to bring the changes into the Impala codebase. I do think we should
> generally try to collaborate with other Apache projects and contribute back
> improvements where possible, instead of forking their code.
> We could re-evaluate this at some point, but reducing the maintenance
> burden for now seems most important.
> *2. C++ Exceptions*
> The ORC library relies on exceptions for error handling. Our coding style
> guide disallows exceptions, but we do have to deal with them in a few
> places interfacing with external libraries like boost (which sucks). If
> we're interfacing with the library, I don't see how we can avoid throwing
> and catching exceptions at the boundaries with the library, which is what
> Quanlong's patch does. I looked at the ORC source code and the exceptions
> seem fairly pervasive.
> My feeling is that we can live with this, provided that we're very careful
> to catch exceptions at the library boundary and it is only
> hdfs-orc-scanner.cc that has to deal with the exceptions.
> *3. What is the quality bar?*
> We definitely want all the functional tests to pass, same as other file
> formats like Avro. I also asked Quanlong to add it to the fuzz test to get
> that extra coverage. It would be helpful if others could confirm that we
> have enough test coverage.
> *4. What is the bar for perf and resource consumption?*
> The initial version of the scanner won't match Parquet in these categories,
> mainly because it isn't tightly integrated with the Impala runtime and is
> missing a few things like codegen. There may be some limits to how far we
> can improve this without modifying the ORC library.
For quality, perf, etc, I think it would be reasonable to ship it for a
release or two as an explicitly labeled experimental feature. I think the
experience in the past is that most file format integrations appear to work
with synthetic datasets but have new and exciting failures and
compatibility issues in the wild once run against the variety of actual
implementations that produce the format.
One thing we have found helpful in the Kudu community is to actually
enforce the "experimental" nature of new features for a couple of releases.
We do this by having a gflag called --unlock-experimental-flags. Users who
decide to flip that on are explicitly acknowledging that they're treading
on some unproven ground and they might get crashes, correctness issues, etc.
Of course the goal should be that, if after a release or two of use the
feedback from users is that it works well and few issues have been found,
I'd expect it to be marked non-experimental.
Software Engineer, Cloudera