[
https://issues.apache.org/jira/browse/ORC-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650804#comment-16650804
]
ASF GitHub Bot commented on ORC-415:
------------------------------------
wgtmac commented on a change in pull request #319: ORC-415: [C++] Fix writing
ColumnStatistics
URL: https://github.com/apache/orc/pull/319#discussion_r225318648
##########
File path: c++/src/ColumnWriter.cc
##########
@@ -280,20 +280,20 @@ namespace orc {
}
// update stats
- bool hasNull = false;
if (!structBatch->hasNulls) {
colIndexStatistics->increase(numValues);
} else {
+ bool hasNull = false;
const char* notNull = structBatch->notNull.data() + offset;
for (uint64_t i = 0; i < numValues; ++i) {
if (notNull[i]) {
colIndexStatistics->increase(1);
} else if (!hasNull) {
hasNull = true;
+ colIndexStatistics->setHasNull(true);
Review comment:
I agree. Now I have moved them out of the for-loop. Please check again.
Thanks!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [C++] Fix writing ColumnStatistics
> ----------------------------------
>
> Key: ORC-415
> URL: https://issues.apache.org/jira/browse/ORC-415
> Project: ORC
> Issue Type: Bug
> Components: C++
> Reporter: Gang Wu
> Assignee: Gang Wu
> Priority: Major
>
> Current C++ ORC writer implementation has two issues about column statistics.
> 1. A new batch may override previous batch's has_null info of
> colIndexStatistics if the new batch has no null but the previous batch has at
> least one null values.
> {code:java}
> bool hasNull = false;
> if (!structBatch->hasNulls) {
> colIndexStatistics->increase(numValues);
> } else {
> const char* notNull = structBatch->notNull.data() + offset;
> for (uint64_t i = 0; i < numValues; ++i) {
> if (notNull[i]) {
> colIndexStatistics->increase(1);
> } else if (!hasNull) {
> hasNull = true;
> }
> }
> }
> colIndexStatistics->setHasNull(hasNull);{code}
> 2. If ColumnStatistics does not have any not-null data, it has no sum/min/max
> infos and this results in writing generic but not type-specific
> ColumnStatistics in the protobuf serialization. The problem is that reader
> will have a hard time to deserialize the ColumnStatistics correctly.
> {code:java}
> void toProtoBuf(proto::ColumnStatistics& pbStats) const override {
> pbStats.set_hasnull(_stats.hasNull());
> pbStats.set_numberofvalues(_stats.getNumberOfValues());
> if (_stats.hasMinimum()) {
> proto::DateStatistics* dateStatistics = pbStats.mutable_datestatistics();
> dateStatistics->set_maximum(_stats.getMaximum());
> dateStatistics->set_minimum(_stats.getMinimum());
> }
> }
> {code}
>
> The scope of this Jira is to fix these two problems.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)