[ 
https://issues.apache.org/jira/browse/ORC-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432910#comment-16432910
 ] 

Gang Wu commented on ORC-322:
-----------------------------

Thanks [~leftylev]! I will work on it.

> c++ writer should not adjust gmtOffset when writing timestamps
> --------------------------------------------------------------
>
>                 Key: ORC-322
>                 URL: https://issues.apache.org/jira/browse/ORC-322
>             Project: ORC
>          Issue Type: Bug
>          Components: C++
>            Reporter: Quanlong Huang
>            Assignee: Gang Wu
>            Priority: Major
>             Fix For: 1.5.0
>
>
> The c++ TimestampColumnWriter will adjust timestamp with gmtOffset:
> {code:java}
>   void TimestampColumnWriter::add(ColumnVectorBatch& rowBatch,
>                                   uint64_t offset,
>                                   uint64_t numValues) {
>     ......
>     int64_t *secs = tsBatch->data.data() + offset;
>     int64_t *nanos = tsBatch->nanoseconds.data() + offset;
>     ......
>     bool hasNull = false;
>     for (uint64_t i = 0; i < numValues; ++i) {
>       if (notNull == nullptr || notNull[i]) {
>         // TimestampVectorBatch already stores data in UTC
>         int64_t millsUTC = secs[i] * 1000 + nanos[i] / 1000000;
>         tsStats->increase(1);
>         tsStats->update(millsUTC);
>         secs[i] -= timezone.getVariant(secs[i]).gmtOffset;    <-- should not 
> adjust with gmtOffset here
>         secs[i] -= timezone.getEpoch();
>         nanos[i] = formatNano(nanos[i]);
>       } else if (!hasNull) {
>         hasNull = true;
>       }
>     }
>     tsStats->setHasNull(hasNull);
>     secRleEncoder->add(secs, numValues, notNull);
>     nanoRleEncoder->add(nanos, numValues, notNull);
>   }
> {code}
> The java reader doesn't adjust this:
> {code:java}
>   public void writeBatch(ColumnVector vector, int offset,
>                          int length) throws IOException {
>     super.writeBatch(vector, offset, length);
>     TimestampColumnVector vec = (TimestampColumnVector) vector;
>     if (vector.isRepeating) {
>       ......
>     } else {
>       for (int i = 0; i < length; ++i) {
>         if (vec.noNulls || !vec.isNull[i + offset]) {
>           // ignore the bottom three digits from the vec.time field
>           final long secs = vec.time[i + offset] / MILLIS_PER_SECOND;
>           final int newNanos = vec.nanos[i + offset];
>           // set the millis based on the top three digits of the nanos
>           long millis = secs * MILLIS_PER_SECOND + newNanos / 1_000_000;
>           if (millis < 0 && newNanos > 999_999) {
>             millis -= MILLIS_PER_SECOND;
>           }
>           long utc = SerializationUtils.convertToUtc(localTimezone, millis);
>           seconds.write(secs - baseEpochSecsLocalTz);   <-- only adjust with 
> ORC epoch
>           nanos.write(formatNanos(newNanos));
>           indexStatistics.updateTimestamp(utc);
>           if (createBloomFilter) {
>             if (bloomFilter != null) {
>               bloomFilter.addLong(millis);
>             }
>             bloomFilterUtf8.addLong(utc);
>           }
>         }
>       }
>     }
>   }
> {code}
> This is a follow-up of ORC-320. I think there's a wrong assumption in c++ 
> codes that timestamps given to the writer's TimestampVectorBatch are equal to 
> timestamps got from the reader's TimestampVectorBatch (see codes in 
> TestWriter::writeTimestamp). Noticed that there're gmtOffset processing 
> logics in the c++ TimestampColumnPrinter.
> To reveal the bug:
> {code:bash}
> $ cat /etc/timezone 
> America/Los_Angeles
> $ cat ts.csv    # prepare a csv file with two timestamps
> 1520762399
> 1520762400
> $ date -d @1520762399
> Sun Mar 11 01:59:59 PST 2018
> $ date -d @1520762400
> Sun Mar 11 03:00:00 PDT 2018
> $ build/tools/src/csv-import "struct<col0:timestamp>" ts.csv ts.orc        # 
> generate orc file by c++ writer
> [2018-03-14 21:41:58] Start importing Orc file...
> [2018-03-14 21:41:58] Finish importing Orc file.
> [2018-03-14 21:41:58] Total writer elasped time: 0.000475s.
> [2018-03-14 21:41:58] Total writer CPU time: 0.000474s.
> $ build/tools/src/orc-contents ts.orc         # read by c++ reader, results 
> are not we written
> {"col0": "2018-03-11 10:59:59.0"}
> {"col0": "2018-03-11 10:00:00.0"}
> $ java -jar java/tools/target/orc-tools-1.5.0-SNAPSHOT-uber.jar data ts.orc   
>  # read by java reader, results are the same as c++ reader
> Processing data file ts.orc [length: 257]
> {"col0":"2018-03-11 10:59:59.0"}
> {"col0":"2018-03-11 10:00:00.0"}
> {code}
> Since there're no tools importing timestamps as long, we can use java codes 
> to generate a correct orc file:
> {code:java}
> public class TimestampWriter {
>   public static void main(String[] args) throws IOException {
>     TypeDescription schema = 
> TypeDescription.fromString("struct<col0:timestamp>");
>     OrcFile.WriterOptions options = OrcFile.writerOptions(new 
> Configuration()).setSchema(schema);
>     Writer writer = OrcFile.createWriter(new Path("./ts_java.orc"), options);
>     VectorizedRowBatch batch = schema.createRowBatch();
>     TimestampColumnVector tsc = (TimestampColumnVector) batch.cols[0];
>     tsc.time[0] = 1520762399000L;  tsc.nanos[0] = 0;    // time array is 
> milliseconds
>     tsc.time[1] = 1520762400000L;  tsc.nanos[1] = 0;
>     batch.size = 2;
>     writer.addRowBatch(batch);
>     writer.close();
>   }
> }
> {code}
> Run the codes to generate `ts_java.orc` and verify this orc file:
> {code:bash}
> $ build/tools/src/orc-contents ts_java.orc                                    
>                                                                               
>                                                                           
> {"col0": "2018-03-11 01:59:59.0"}    # The correct results
> {"col0": "2018-03-11 03:00:00.0"}
> $ java -jar java/tools/target/orc-tools-1.5.0-SNAPSHOT-uber.jar data 
> ts_java.orc 
> Processing data file ts_java.orc [length: 251]
> {"col0":"2018-03-11 01:59:59.0"}    # The correct results
> {"col0":"2018-03-11 03:00:00.0"}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to