Github user tejasapatil commented on the issue:
    Earlier this year I had spent some time trying out Presto's ORC reader with 
    In standalone benchmark, Presto's ORC reader is 3x faster than the one in 
Hive. My experimental setup was to add a CLI around ORC reader code in both 
hive and presto. It would read a file from local disk and deserialize all the 
    With that promising result, I tried hooking Presto reader with Spark. 
Presto has its own notion of "page" to manage memory and "slice" for Unsafe 
access to data. When I integrated it with Spark, I had to add a shim to convert 
Spark objects to corresponding Presto objects. This hurts performance. 
    Then I decided to fork few classes from Presto to make it directly work 
with Spark's internal data representations (eg. `UTF8String`). My final numbers 
(for end to end runs of Spark jobs) were ranging from no gains to 2x 
improvement. Note that there were things Presto's reader supports (namely 
vectorization and predicate pushdown) which I had not integrated as that would 
have demanded more forking. The measurements were over queries which ran over 
large table, read all the rows and wrote those as-is to another table. Having 
done all these things, my personal take is that forking classes is bad from a 
maintenance standpoint. Presto's ORC reader is tightly coupled with the 
engines' internal constructs and refactoring it to make it generic is 
non-trivial work. We can think about having a native ORC reader in Spark (just 
like Presto does) which would be super performant but the downside is that one 
has to be sync with all upstream changes to the file format as it evolves. 
    cc @rxin (I recall you were also interested in this at some point)

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to