[
https://issues.apache.org/jira/browse/ORC-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun resolved ORC-1151.
--------------------------------
Fix Version/s: 1.7.5
Resolution: Fixed
This is resolved via https://github.com/apache/orc/pull/1088
> [C++] Incorrect statistics for Timestamp column with non UTC writer time zones
> ------------------------------------------------------------------------------
>
> Key: ORC-1151
> URL: https://issues.apache.org/jira/browse/ORC-1151
> Project: ORC
> Issue Type: Bug
> Components: C++
> Affects Versions: 1.8.0, 1.7.4
> Reporter: noirello
> Assignee: noirello
> Priority: Major
> Fix For: 1.7.5
>
>
> When the writer time zone is not UTC, then the statistics for timestamp type
> is incorrect.
> Minimal example to reproduce:
> {code:java}
> #include "orc/OrcFile.hh"
> int main() {
> std::unique_ptr<orc::Type>
> type(orc::Type::buildTypeFromString("struct<x:int,y:timestamp>"));
> std::unique_ptr<orc::OutputStream> outStream =
> orc::writeLocalFile("./test.orc");
> orc::WriterOptions options;
> options.setTimezoneName("Asia/Shanghai");
> std::unique_ptr<orc::Writer> writer = createWriter(*type,
> outStream.get(), options);
> std::unique_ptr<orc::ColumnVectorBatch> batch =
> writer->createRowBatch(1);
> orc::StructVectorBatch *root = dynamic_cast<orc::StructVectorBatch
> *>(batch.get());
> orc::LongVectorBatch *x = dynamic_cast<orc::LongVectorBatch
> *>(root->fields[0]);
> orc::TimestampVectorBatch *y = dynamic_cast<orc::TimestampVectorBatch
> *>(root->fields[1]);
> x->data[0] = 1;
> y->data[0] = 1650133963; // 2022-04-16T18:32:43.3210+00:00
> y->nanoseconds[0] = 321000000;
> x->numElements = 1;
> y->numElements = 1;
> root->numElements = 1;
> writer->add(*batch);
> writer->close();
> return 0;
> } {code}
> Statistics:
> {code:java}
> # bin/orc-statistics test.orc
> File test.orc has 3 columns
> *** Column 0 ***
> Column has 1 values and has null value: no
> *** Column 1 ***
> Data type: Integer
> Values: 1
> Has null: no
> Minimum: 1
> Maximum: 1
> Sum: 1*** Column 2 ***
> Data type: Timestamp
> Values: 1
> Has null: no
> Minimum: 2022-04-16 18:33:12.121
> LowerBound: 2022-04-16 18:33:12.121
> Maximum: 2022-04-16 18:33:12.121
> UpperBound: 2022-04-16 18:33:12.122
> File test.orc has 1 stripes
> *** Stripe 0 ***
> --- Column 0 ---
> Column has 1 values and has null value: no
> --- Column 1 ---
> Data type: Integer
> Values: 1
> Has null: no
> Minimum: 1
> Maximum: 1
> Sum: 1
> --- Column 2 ---
> Data type: Timestamp
> Values: 1
> Has null: no
> Minimum: 2022-04-16 18:33:12.121
> LowerBound: 2022-04-16 18:33:12.121
> Maximum: 2022-04-16 18:33:12.121
> UpperBound: 2022-04-16 18:33:12.122{code}
> Content:
> {code:java}
> # bin/orc-contents test.orc
> {"x": 1, "y": "2022-04-17 02:32:43.321"}{code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)