Quanlong has done a bunch of work implementing an ORC scanner. I've been
playing around with it and it works pretty nicely - I can load and run
TPC-H with no problem!

It's a big addition to Impala and the integration with the external library
has caused some implementation challenges, so I wanted to summarise the
issues so that other community members can weigh in.

*1. in-tree versus third-party library*
The current patchset imports the ORC source code into the Impala source
tree, rather than treating it as an external library.

My feeling is that, initially, we don't want to take on the burden of
maintaining a fork of the library, so we're best off keeping it external.
This way we can upgrade the library to pick up new ORC features instead of
having to bring the changes into the Impala codebase. I do think we should
generally try to collaborate with other Apache projects and contribute back
improvements where possible, instead of forking their code.

We could re-evaluate this at some point, but reducing the maintenance
burden for now seems most important.

*2. C++ Exceptions*
The ORC library relies on exceptions for error handling. Our coding style
guide disallows exceptions, but we do have to deal with them in a few
places interfacing with external libraries like boost (which sucks). If
we're interfacing with the library, I don't see how we can avoid throwing
and catching exceptions at the boundaries with the library, which is what
Quanlong's patch does. I looked at the ORC source code and the exceptions
seem fairly pervasive.

My feeling is that we can live with this, provided that we're very careful
to catch exceptions at the library boundary and it is only
hdfs-orc-scanner.cc that has to deal with the exceptions.

*3. What is the quality bar?*
We definitely want all the functional tests to pass, same as other file
formats like Avro. I also asked Quanlong to add it to the fuzz test to get
that extra coverage. It would be helpful if others could confirm that we
have enough test coverage.

*4. What is the bar for perf and resource consumption?*
The initial version of the scanner won't match Parquet in these categories,
mainly because it isn't tightly integrated with the Impala runtime and is
missing a few things like codegen. There may be some limits to how far we
can improve this without modifying the ORC library.

I think it's important that the code is stable and functionally correct,
but we can live with only OK performance. We should clearly document that
Parquet is the more performant format to avoid confusion for users.

What does everyone think?

Reply via email to