[
https://issues.apache.org/jira/browse/ORC-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Quanlong Huang updated ORC-1143:
--------------------------------
Description:
Queries like "select count(a) from tbl" just requires checking whether the
column value is not NULL. ORC files already have the PRESENT stream for each
column (though it's optional). We can serve the request by just reading the
PRESENT stream.
Currently, ReadIntent has two items:
{code:java}
enum ReadIntent {
ReadIntent_ALL = 0,
// Only read the offsets of selected type. Do not read the children types.
ReadIntent_OFFSETS = 1
};{code}
We can extend it to add an item like ReadIntent_PRESENT. The corresponding
ColumnVectorBatch will only have valid notNull results.
This would help more on string columns. E.g. checking how many customers have
email address
{code:sql}
select count(email_address) from tpcds.customer {code}
was:
Queries like "select count(a) from tbl" just requires checking whether the
column value is not NULL. ORC files already have the PRESENT stream for each
column (though it's optional). We can serve the request by just reading the
PRESENT stream.
Currently, ReadIntent has two items:
{code:java}
enum ReadIntent {
ReadIntent_ALL = 0,
// Only read the offsets of selected type. Do not read the children types.
ReadIntent_OFFSETS = 1
};{code}
We can extend it to add an item like ReadIntent_PRESENT. The corresponding
ColumnVectorBatch will only have valid notNull results.
> [C++] Support reading the PRESENT stream without reading the column data
> ------------------------------------------------------------------------
>
> Key: ORC-1143
> URL: https://issues.apache.org/jira/browse/ORC-1143
> Project: ORC
> Issue Type: New Feature
> Components: C++
> Reporter: Quanlong Huang
> Priority: Major
>
> Queries like "select count(a) from tbl" just requires checking whether the
> column value is not NULL. ORC files already have the PRESENT stream for each
> column (though it's optional). We can serve the request by just reading the
> PRESENT stream.
> Currently, ReadIntent has two items:
> {code:java}
> enum ReadIntent {
> ReadIntent_ALL = 0,
> // Only read the offsets of selected type. Do not read the children types.
> ReadIntent_OFFSETS = 1
> };{code}
> We can extend it to add an item like ReadIntent_PRESENT. The corresponding
> ColumnVectorBatch will only have valid notNull results.
> This would help more on string columns. E.g. checking how many customers have
> email address
> {code:sql}
> select count(email_address) from tpcds.customer {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)