Quanlong has done a bunch of work implementing an ORC scanner. I've been playing around with it and it works pretty nicely - I can load and run TPC-H with no problem!
It's a big addition to Impala and the integration with the external library has caused some implementation challenges, so I wanted to summarise the issues so that other community members can weigh in. *1. in-tree versus third-party library* The current patchset imports the ORC source code into the Impala source tree, rather than treating it as an external library. My feeling is that, initially, we don't want to take on the burden of maintaining a fork of the library, so we're best off keeping it external. This way we can upgrade the library to pick up new ORC features instead of having to bring the changes into the Impala codebase. I do think we should generally try to collaborate with other Apache projects and contribute back improvements where possible, instead of forking their code. We could re-evaluate this at some point, but reducing the maintenance burden for now seems most important. *2. C++ Exceptions* The ORC library relies on exceptions for error handling. Our coding style guide disallows exceptions, but we do have to deal with them in a few places interfacing with external libraries like boost (which sucks). If we're interfacing with the library, I don't see how we can avoid throwing and catching exceptions at the boundaries with the library, which is what Quanlong's patch does. I looked at the ORC source code and the exceptions seem fairly pervasive. My feeling is that we can live with this, provided that we're very careful to catch exceptions at the library boundary and it is only hdfs-orc-scanner.cc that has to deal with the exceptions. *3. What is the quality bar?* We definitely want all the functional tests to pass, same as other file formats like Avro. I also asked Quanlong to add it to the fuzz test to get that extra coverage. It would be helpful if others could confirm that we have enough test coverage. *4. What is the bar for perf and resource consumption?* The initial version of the scanner won't match Parquet in these categories, mainly because it isn't tightly integrated with the Impala runtime and is missing a few things like codegen. There may be some limits to how far we can improve this without modifying the ORC library. I think it's important that the code is stable and functionally correct, but we can live with only OK performance. We should clearly document that Parquet is the more performant format to avoid confusion for users. What does everyone think?