[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...

felixcheung Sun, 18 Jan 2015 22:13:19 -0800

Github user felixcheung commented on the pull request:

    https://github.com/apache/spark/pull/3820#issuecomment-70450284
  
    I've tested this PR but the result seems to be off.
    Parquet generated from Hive with timestamp values set by 
'from_utc_timestamp('1970-01-01 08:00:00','PST')'
    
    What I see with this PR:
    scala> t.take(10).foreach(println(_))
    ...
    15/01/18 22:06:41 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: 
file:/users/x/parquetwithtimestamp start: 0 end: 25448 length: 25448 hosts: [] 
requestedSchema: message root {
      optional binary code (UTF8);
      optional binary description (UTF8);
      optional int32 total_emp;
      optional int32 salary;
      optional int96 timestamp;
    }
     readSupportMetadata: 
{org.apache.spark.sql.parquet.row.metadata={"type":"struct","fields":[{"name":"code","type":"string","nullable":true,"metadata":{}},{"name":"description","type":"string","nullable":true,"metadata":{}},{"name":"total_emp","type":"integer","nullable":true,"metadata":{}},{"name":"salary","type":"integer","nullable":true,"metadata":{}},{"name":"timestamp","type":"timestamp","nullable":true,"metadata":{}}]},
 
org.apache.spark.sql.parquet.row.requested_schema={"type":"struct","fields":[{"name":"code","type":"string","nullable":true,"metadata":{}},{"name":"description","type":"string","nullable":true,"metadata":{}},{"name":"total_emp","type":"integer","nullable":true,"metadata":{}},{"name":"salary","type":"integer","nullable":true,"metadata":{}},{"name":"timestamp","type":"timestamp","nullable":true,"metadata":{}}]}}}
    15/01/18 22:06:41 WARN ParquetRecordReader: Can not initialize counter due 
to context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
    15/01/18 22:06:41 INFO InternalParquetRecordReader: RecordReader 
initialized will read a total of 823 records.
    15/01/18 22:06:41 INFO InternalParquetRecordReader: at row 0. reading next 
block
    15/01/18 22:06:41 INFO CodecPool: Got brand-new decompressor [.snappy]
    15/01/18 22:06:41 INFO InternalParquetRecordReader: block read in memory in 
27 ms. row count = 823
    [00-0000,All Occupations,134354250,40690,1974-01-07 17:58:00.000008896]
    [11-0000,Management occupations,6003930,96150,1974-01-07 17:58:00.000008896]
    
    Expect: 1970-01-01 08:00:00
    
    Actual: 1974-01-07 17:58:00.000008896
    
    Any idea?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...

Reply via email to