Gang Wu created ORC-415:
---------------------------

             Summary: [C++] Fix writing ColumnStatistics
                 Key: ORC-415
                 URL: https://issues.apache.org/jira/browse/ORC-415
             Project: ORC
          Issue Type: Bug
          Components: C++
            Reporter: Gang Wu
            Assignee: Gang Wu


Current C++ ORC writer implementation has two issues about column statistics.

1. A new batch may override previous batch's has_null info of 
colIndexStatistics if the new batch has no null but the previous batch has at 
least one null values.
{code:java}
bool hasNull = false;
if (!structBatch->hasNulls) {
  colIndexStatistics->increase(numValues);
} else {
  const char* notNull = structBatch->notNull.data() + offset;
  for (uint64_t i = 0; i < numValues; ++i) {
    if (notNull[i]) {
      colIndexStatistics->increase(1);
    } else if (!hasNull) {
      hasNull = true;
    }
  }
}
colIndexStatistics->setHasNull(hasNull);{code}
2. If ColumnStatistics does not have any not-null data, it has no sum/min/max 
infos and this results in writing generic but not type-specific 
ColumnStatistics in the protobuf serialization. The problem is that reader will 
have a hard time to deserialize the ColumnStatistics correctly.
{code:java}
void toProtoBuf(proto::ColumnStatistics& pbStats) const override {
  pbStats.set_hasnull(_stats.hasNull());
  pbStats.set_numberofvalues(_stats.getNumberOfValues());
  if (_stats.hasMinimum()) {
    proto::DateStatistics* dateStatistics = pbStats.mutable_datestatistics();
    dateStatistics->set_maximum(_stats.getMaximum());
    dateStatistics->set_minimum(_stats.getMinimum());
  }
}
{code}
 

The scope of this Jira is to fix these two problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to