[ 
https://issues.apache.org/jira/browse/ORC-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated ORC-1143:
--------------------------------
    Description: 
Queries like "select count(a) from tbl" just requires checking whether the 
column value is not NULL. ORC files already have the PRESENT stream for each 
column (though it's optional). We can serve the request by just reading the 
PRESENT stream.

Currently, ReadIntent has two items:
{code:java}
enum ReadIntent {
  ReadIntent_ALL = 0,

  // Only read the offsets of selected type. Do not read the children types.
  ReadIntent_OFFSETS = 1
};{code}
We can extend it to add an item like ReadIntent_PRESENT. The corresponding 
ColumnVectorBatch will only have valid notNull results.

This would help more on string columns. E.g. checking how many customers have 
email address
{code:sql}
select count(email_address) from tpcds.customer {code}

  was:
Queries like "select count(a) from tbl" just requires checking whether the 
column value is not NULL. ORC files already have the PRESENT stream for each 
column (though it's optional). We can serve the request by just reading the 
PRESENT stream.

Currently, ReadIntent has two items:
{code:java}
enum ReadIntent {
  ReadIntent_ALL = 0,

  // Only read the offsets of selected type. Do not read the children types.
  ReadIntent_OFFSETS = 1
};{code}
We can extend it to add an item like ReadIntent_PRESENT. The corresponding 
ColumnVectorBatch will only have valid notNull results.


> [C++] Support reading the PRESENT stream without reading the column data
> ------------------------------------------------------------------------
>
>                 Key: ORC-1143
>                 URL: https://issues.apache.org/jira/browse/ORC-1143
>             Project: ORC
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Quanlong Huang
>            Priority: Major
>
> Queries like "select count(a) from tbl" just requires checking whether the 
> column value is not NULL. ORC files already have the PRESENT stream for each 
> column (though it's optional). We can serve the request by just reading the 
> PRESENT stream.
> Currently, ReadIntent has two items:
> {code:java}
> enum ReadIntent {
>   ReadIntent_ALL = 0,
>   // Only read the offsets of selected type. Do not read the children types.
>   ReadIntent_OFFSETS = 1
> };{code}
> We can extend it to add an item like ReadIntent_PRESENT. The corresponding 
> ColumnVectorBatch will only have valid notNull results.
> This would help more on string columns. E.g. checking how many customers have 
> email address
> {code:sql}
> select count(email_address) from tpcds.customer {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to