[ 
https://issues.apache.org/jira/browse/ORC-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16446397#comment-16446397
 ] 

Dain Sundstrom commented on ORC-350:
------------------------------------

I think we should add a truncated flag to the stats, that way the writer can 
simply chop the data when large.  The PPD system can still use the prefix value 
to filter min/max ranges.

 

Also, we should add support min/max for varbinary.  In SQL varbinary doesn't 
mean random binary data, it means bytes without a specified character encoding. 
 It is common for people to use varbinary to avoid expensive unnecessary 
character encodings, but the data is still usable with min/max (and dictionary 
encodings).

> Optionally disable/specify indexes for columns
> ----------------------------------------------
>
>                 Key: ORC-350
>                 URL: https://issues.apache.org/jira/browse/ORC-350
>             Project: ORC
>          Issue Type: Sub-task
>            Reporter: Prasanth Jayachandran
>            Priority: Major
>
> There are many cases where entire xml or big json is stored as string column. 
> If we autogenerate indexes on those columns, we often run into issues with 
> protobuf stream explosion. The only workaround for now is to change from 
> string to binary. It will be good to have an option to disable indexes on 
> specific columns. 
> Regardless, I think we should have max limits on string column statistics. If 
> that limit is exceeded PPD should handle it accordingly (by returning 
> YES_NO_NULL).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to