John Omernik created DRILL-4758:
-----------------------------------
Summary: Option for Lazy/Late Materialization of columns during
query with Parquet
Key: DRILL-4758
URL: https://issues.apache.org/jira/browse/DRILL-4758
Project: Apache Drill
Issue Type: Improvement
Components: Storage - Parquet
Affects Versions: 1.6.0
Reporter: John Omernik
On tables stored as Parquet with lots of columns, it appears that all columns
requested in the select statement are materialized for every row, regardless of
the where clause filter.
For example, a table with 100 columns,
select field1 from table where id = 123 and client BETWEEN 10 and 100
Will return in 30 seconds a large amount of data (2 TB) and return no rows.
However,
select * from table where id = 123 and client BETWEEN 10 and 100
will take 15 minutes to run on the same amount of data, while still returning
no rows.
If an option (perhaps it should be the default) to only materialize rows that
match the filter were present, it would provide a huge boon to performance.
Now, if this were an issue because tables with a small number of columns would
now have an extra step, one option would be to use table options (select with
options) to make it so queries to certain tables would have this option, and
queries to other tables would not. This is up for discussion, but I think the
first step is to discuss how something this could be achieved. This is an item
also being looked at by the Impala project on Parquet files. (IMPALA-2017)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)