Gang Wu created ORC-415:
---------------------------
Summary: [C++] Fix writing ColumnStatistics
Key: ORC-415
URL: https://issues.apache.org/jira/browse/ORC-415
Project: ORC
Issue Type: Bug
Components: C++
Reporter: Gang Wu
Assignee: Gang Wu
Current C++ ORC writer implementation has two issues about column statistics.
1. A new batch may override previous batch's has_null info of
colIndexStatistics if the new batch has no null but the previous batch has at
least one null values.
{code:java}
bool hasNull = false;
if (!structBatch->hasNulls) {
colIndexStatistics->increase(numValues);
} else {
const char* notNull = structBatch->notNull.data() + offset;
for (uint64_t i = 0; i < numValues; ++i) {
if (notNull[i]) {
colIndexStatistics->increase(1);
} else if (!hasNull) {
hasNull = true;
}
}
}
colIndexStatistics->setHasNull(hasNull);{code}
2. If ColumnStatistics does not have any not-null data, it has no sum/min/max
infos and this results in writing generic but not type-specific
ColumnStatistics in the protobuf serialization. The problem is that reader will
have a hard time to deserialize the ColumnStatistics correctly.
{code:java}
void toProtoBuf(proto::ColumnStatistics& pbStats) const override {
pbStats.set_hasnull(_stats.hasNull());
pbStats.set_numberofvalues(_stats.getNumberOfValues());
if (_stats.hasMinimum()) {
proto::DateStatistics* dateStatistics = pbStats.mutable_datestatistics();
dateStatistics->set_maximum(_stats.getMaximum());
dateStatistics->set_minimum(_stats.getMinimum());
}
}
{code}
The scope of this Jira is to fix these two problems.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)