[
https://issues.apache.org/jira/browse/ORC-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gang Wu reassigned ORC-415:
---------------------------
> [C++] Fix writing ColumnStatistics
> ----------------------------------
>
> Key: ORC-415
> URL: https://issues.apache.org/jira/browse/ORC-415
> Project: ORC
> Issue Type: Bug
> Components: C++
> Reporter: Gang Wu
> Assignee: Gang Wu
> Priority: Major
>
> Current C++ ORC writer implementation has two issues about column statistics.
> 1. A new batch may override previous batch's has_null info of
> colIndexStatistics if the new batch has no null but the previous batch has at
> least one null values.
> {code:java}
> bool hasNull = false;
> if (!structBatch->hasNulls) {
> colIndexStatistics->increase(numValues);
> } else {
> const char* notNull = structBatch->notNull.data() + offset;
> for (uint64_t i = 0; i < numValues; ++i) {
> if (notNull[i]) {
> colIndexStatistics->increase(1);
> } else if (!hasNull) {
> hasNull = true;
> }
> }
> }
> colIndexStatistics->setHasNull(hasNull);{code}
> 2. If ColumnStatistics does not have any not-null data, it has no sum/min/max
> infos and this results in writing generic but not type-specific
> ColumnStatistics in the protobuf serialization. The problem is that reader
> will have a hard time to deserialize the ColumnStatistics correctly.
> {code:java}
> void toProtoBuf(proto::ColumnStatistics& pbStats) const override {
> pbStats.set_hasnull(_stats.hasNull());
> pbStats.set_numberofvalues(_stats.getNumberOfValues());
> if (_stats.hasMinimum()) {
> proto::DateStatistics* dateStatistics = pbStats.mutable_datestatistics();
> dateStatistics->set_maximum(_stats.getMaximum());
> dateStatistics->set_minimum(_stats.getMinimum());
> }
> }
> {code}
>
> The scope of this Jira is to fix these two problems.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)