Mark Litwintschik created ARROW-6815: ----------------------------------------
Summary: Timestamps saved via Pandas and PyArrow unreadable in Hive and Presto Key: ARROW-6815 URL: https://issues.apache.org/jira/browse/ARROW-6815 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.13.0 Reporter: Mark Litwintschik I'm unable to read timestamps saved as Parquet data via Pandas with Hive or Presto. These are the versions of the various pieces of software I'm using: * Pandas 0.24.1 * PyArrow 0.13.0 * Hadoop 3.0.3 * Hive 2.3.3 * Presto 0.215 The Hadoop setup steps I took can be found [here|https://tech.marksblogg.com/hadoop-3-single-node-install-guide.html]. This is what I did to generate the Parquet file. {code:python} from datetime import datetime from StringIO import StringIO import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame([{'a': 'Test', 'b': datetime.utcnow()}]) parquet_buffer = StringIO() pa_table = pa.Table.from_pandas(df, preserve_index=False) writer = pq.ParquetWriter(parquet_buffer, pa_table.schema) writer.write_table(pa_table) writer.close() with open('example.pq', 'w+b') as f: f.write(parquet_buffer.getvalue()) {code} This is its raw contents: {code:bash} $ hexdump -C example.pq {code} {code:none} 00000000 50 41 52 31 15 04 15 10 15 14 4c 15 02 15 04 12 |PAR1......L.....| 00000010 00 00 08 1c 04 00 00 00 54 65 73 74 15 00 15 12 |........Test....| 00000020 15 16 2c 15 02 15 04 15 06 15 06 1c 18 04 54 65 |..,...........Te| 00000030 73 74 18 04 54 65 73 74 16 00 00 00 00 09 20 02 |st..Test...... .| 00000040 00 00 00 02 01 01 02 00 26 90 01 1c 15 0c 19 35 |........&......5| 00000050 04 00 06 19 18 01 61 15 02 16 02 16 80 01 16 88 |......a.........| 00000060 01 26 38 26 08 1c 36 00 28 04 54 65 73 74 18 04 |.&8&..6.(.Test..| 00000070 54 65 73 74 00 00 00 15 04 15 10 15 14 4c 15 02 |Test.........L..| 00000080 15 04 12 00 00 08 1c 10 4f 48 96 63 94 05 00 15 |........OH.c....| 00000090 00 15 12 15 16 2c 15 02 15 04 15 06 15 06 1c 18 |.....,..........| 000000a0 08 10 4f 48 96 63 94 05 00 18 08 10 4f 48 96 63 |..OH.c......OH.c| 000000b0 94 05 00 16 00 00 00 00 09 20 02 00 00 00 02 01 |......... ......| 000000c0 01 02 00 26 86 03 1c 15 04 19 35 04 00 06 19 18 |...&......5.....| 000000d0 01 62 15 02 16 02 16 90 01 16 98 01 26 9e 02 26 |.b..........&..&| 000000e0 ee 01 1c 18 08 10 4f 48 96 63 94 05 00 18 08 10 |......OH.c......| 000000f0 4f 48 96 63 94 05 00 16 00 28 08 10 4f 48 96 63 |OH.c.....(..OH.c| 00000100 94 05 00 18 08 10 4f 48 96 63 94 05 00 00 00 00 |......OH.c......| 00000110 15 02 19 3c 35 00 18 06 73 63 68 65 6d 61 15 04 |...<5...schema..| 00000120 00 15 0c 25 02 18 01 61 00 15 04 25 02 18 01 62 |...%...a...%...b| 00000130 25 14 00 16 02 19 1c 19 2c 26 90 01 1c 15 0c 19 |%.......,&......| 00000140 35 04 00 06 19 18 01 61 15 02 16 02 16 80 01 16 |5......a........| 00000150 88 01 26 38 26 08 1c 36 00 28 04 54 65 73 74 18 |..&8&..6.(.Test.| 00000160 04 54 65 73 74 00 00 00 26 86 03 1c 15 04 19 35 |.Test...&......5| 00000170 04 00 06 19 18 01 62 15 02 16 02 16 90 01 16 98 |......b.........| 00000180 01 26 9e 02 26 ee 01 1c 18 08 10 4f 48 96 63 94 |.&..&......OH.c.| 00000190 05 00 18 08 10 4f 48 96 63 94 05 00 16 00 28 08 |.....OH.c.....(.| 000001a0 10 4f 48 96 63 94 05 00 18 08 10 4f 48 96 63 94 |.OH.c......OH.c.| 000001b0 05 00 00 00 00 16 a0 02 16 02 00 19 1c 18 06 70 |...............p| 000001c0 61 6e 64 61 73 18 db 02 7b 22 63 72 65 61 74 6f |andas...{"creato| 000001d0 72 22 3a 20 7b 22 76 65 72 73 69 6f 6e 22 3a 20 |r": {"version": | 000001e0 22 30 2e 31 33 2e 30 22 2c 20 22 6c 69 62 72 61 |"0.13.0", "libra| 000001f0 72 79 22 3a 20 22 70 79 61 72 72 6f 77 22 7d 2c |ry": "pyarrow"},| 00000200 20 22 70 61 6e 64 61 73 5f 76 65 72 73 69 6f 6e | "pandas_version| 00000210 22 3a 20 6e 75 6c 6c 2c 20 22 69 6e 64 65 78 5f |": null, "index_| 00000220 63 6f 6c 75 6d 6e 73 22 3a 20 5b 5d 2c 20 22 63 |columns": [], "c| 00000230 6f 6c 75 6d 6e 73 22 3a 20 5b 7b 22 6d 65 74 61 |olumns": [{"meta| 00000240 64 61 74 61 22 3a 20 6e 75 6c 6c 2c 20 22 66 69 |data": null, "fi| 00000250 65 6c 64 5f 6e 61 6d 65 22 3a 20 22 61 22 2c 20 |eld_name": "a", | 00000260 22 6e 61 6d 65 22 3a 20 22 61 22 2c 20 22 6e 75 |"name": "a", "nu| 00000270 6d 70 79 5f 74 79 70 65 22 3a 20 22 6f 62 6a 65 |mpy_type": "obje| 00000280 63 74 22 2c 20 22 70 61 6e 64 61 73 5f 74 79 70 |ct", "pandas_typ| 00000290 65 22 3a 20 22 62 79 74 65 73 22 7d 2c 20 7b 22 |e": "bytes"}, {"| 000002a0 6d 65 74 61 64 61 74 61 22 3a 20 6e 75 6c 6c 2c |metadata": null,| 000002b0 20 22 66 69 65 6c 64 5f 6e 61 6d 65 22 3a 20 22 | "field_name": "| 000002c0 62 22 2c 20 22 6e 61 6d 65 22 3a 20 22 62 22 2c |b", "name": "b",| 000002d0 20 22 6e 75 6d 70 79 5f 74 79 70 65 22 3a 20 22 | "numpy_type": "| 000002e0 64 61 74 65 74 69 6d 65 36 34 5b 6e 73 5d 22 2c |datetime64[ns]",| 000002f0 20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20 | "pandas_type": | 00000300 22 64 61 74 65 74 69 6d 65 22 7d 5d 2c 20 22 63 |"datetime"}], "c| 00000310 6f 6c 75 6d 6e 5f 69 6e 64 65 78 65 73 22 3a 20 |olumn_indexes": | 00000320 5b 5d 7d 00 18 22 70 61 72 71 75 65 74 2d 63 70 |[]}.."parquet-cp| 00000330 70 20 76 65 72 73 69 6f 6e 20 31 2e 35 2e 31 2d |p version 1.5.1-| 00000340 53 4e 41 50 53 48 4f 54 19 2c 1c 00 00 1c 00 00 |SNAPSHOT.,......| 00000350 00 41 02 00 00 50 41 52 31 |.A...PAR1| 00000359 {code} These are the strings extracted from that file. {code:bash} $ strings example.pq {code} {code:none} PAR1 Test Test Test Test Test schema Test Test pandas {"creator": {"version": "0.13.0", "library": "pyarrow"}, "pandas_version": null, "index_columns": [], "columns": [{"metadata": null, "field_name": "a", "name": "a", "numpy_type": "object", "pandas_type": "bytes"}, {"metadata": null, "field_name": "b", "name": "b", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}], "column_indexes": []} "parquet-cpp version 1.5.1-SNAPSHOT PAR1 {code} If I copy it onto HDFS I'm unable to read the timestamp. I've tried a variety of data types. {code:bash} $ hdfs dfs -mkdir /testing/ $ hdfs dfs -copyFromLocal example.pq /testing/ $ hdfs dfs -ls /testing/example.pq {code} {code:none} -rw-r--r-- 1 ubuntu supergroup 857 2019-10-08 10:26 /testing/example.pq {code} {code:bash} $ hive {code} It can't be read as a DATETIME field: {code:sql} CREATE EXTERNAL TABLE testing ( a STRING, b DATETIME ) STORED AS parquet LOCATION '/testing/'; {code} {code:none} FAILED: SemanticException [Error 10099]: DATETIME type isn't supported yet. Please use DATE or TIMESTAMP instead {code} It can't be read as a TIMESTAMP: {code:sql} CREATE EXTERNAL TABLE testing ( a STRING, b TIMESTAMP ) STORED AS parquet LOCATION '/testing/'; SELECT * FROM testing; {code} {code:none} Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable {code} I can extract it as a BIGINT but I'm unable to find a math formula to convert it to the correct time. {code:sql} DROP TABLE `testing`; CREATE EXTERNAL TABLE `testing` ( a STRING, b BIGINT ) STORED AS parquet LOCATION '/testing/'; {code} {code:none} Test 1570530327547664 {code} {code:sql} SELECT CAST(b AS TIMESTAMP) FROM testing; {code} {code:none} 51738-02-15 08:19:077.664 {code} {code:sql} SELECT CAST(b/1000 AS TIMESTAMP) FROM testing; {code} {code:none} 51738-02-15 08:19:077.664 {code} I'll generate the same content using Hive and show the contents of the Parquet file it produces: {code:sql} CREATE TABLE `testing2` ( a STRING, b TIMESTAMP ) STORED AS parquet LOCATION '/testing2/'; INSERT INTO testing2 (a, b) VALUES ('Test', "2019-10-08 10:34:30.061"); SELECT * FROM testing2; {code} {code:none} Test 2019-10-08 10:34:30.061 {code} The timestamp is usable. {code:sql} SELECT YEAR(B) FROM testing2; {code} {code:none} 2019 {code} {code:bash} $ hdfs dfs -copyToLocal /testing2/000000_0 ./ $ hexdump -C 000000_0 {code} {code:none} 00000000 50 41 52 31 15 00 15 1c 15 1c 2c 15 02 15 00 15 |PAR1......,.....| 00000010 06 15 08 1c 18 04 54 65 73 74 18 04 54 65 73 74 |......Test..Test| 00000020 16 00 00 00 00 02 00 00 00 03 01 04 00 00 00 54 |...............T| 00000030 65 73 74 15 04 15 18 15 18 4c 15 02 15 04 00 00 |est......L......| 00000040 40 65 76 e0 9f 22 00 00 8d 84 25 00 15 00 15 10 |@ev.."....%.....| 00000050 15 10 2c 15 02 15 04 15 06 15 08 1c 18 0c 40 65 |..,...........@e| 00000060 76 e0 9f 22 00 00 8d 84 25 00 18 0c 40 65 76 e0 |v.."....%...@ev.| 00000070 9f 22 00 00 8d 84 25 00 16 00 00 00 00 02 00 00 |."....%.........| 00000080 00 03 01 00 03 15 02 19 3c 48 0b 68 69 76 65 5f |........<H.hive_| 00000090 73 63 68 65 6d 61 15 04 00 15 0c 25 02 18 01 61 |schema.....%...a| 000000a0 25 00 00 15 06 25 02 18 01 62 00 16 02 19 1c 19 |%....%...b......| 000000b0 2c 26 08 1c 15 0c 19 35 08 06 00 19 18 01 61 15 |,&.....5......a.| 000000c0 00 16 02 16 5e 16 5e 26 08 3c 18 04 54 65 73 74 |....^.^&.<..Test| 000000d0 18 04 54 65 73 74 16 00 00 00 00 26 66 1c 15 06 |..Test.....&f...| 000000e0 19 35 08 04 06 19 18 01 62 15 00 16 02 16 a4 01 |.5......b.......| 000000f0 16 a4 01 26 66 3c 18 0c 40 65 76 e0 9f 22 00 00 |...&f<..@ev.."..| 00000100 8d 84 25 00 18 0c 40 65 76 e0 9f 22 00 00 8d 84 |..%...@ev.."....| 00000110 25 00 16 00 00 00 00 16 82 02 16 02 00 28 49 70 |%............(Ip| 00000120 61 72 71 75 65 74 2d 6d 72 20 76 65 72 73 69 6f |arquet-mr versio| 00000130 6e 20 31 2e 38 2e 31 20 28 62 75 69 6c 64 20 34 |n 1.8.1 (build 4| 00000140 61 62 61 34 64 61 65 37 62 62 30 64 34 65 64 62 |aba4dae7bb0d4edb| 00000150 63 66 37 39 32 33 61 65 31 33 33 39 66 32 38 66 |cf7923ae1339f28f| 00000160 64 33 66 37 66 63 66 29 00 e4 00 00 00 50 41 52 |d3f7fcf).....PAR| 00000170 31 |1| 00000171 {code} {code:bash} $ strings 000000_0 {code} {code:none} PAR1 Test Test Test hive_schema Test Test (Iparquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) PAR1 {code} I'll do the same for Presto as it has its own writer as well. {code:bash} $ presto --server localhost:8085 --schema default --catalog hive {code} {code:sql} CREATE TABLE testing3 WITH (FORMAT='PARQUET') AS SELECT 'Test' as a, CAST('2019-10-08 10:34:30.061' AS TIMESTAMP) AS b; SELECT * FROM testing3; {code} {code:none} a | b ------+------------------------- Test | 2019-10-08 10:34:30.061 {code} {code:sql} SELECT YEAR(b) FROM testing3; {code} {code:none} _col0 ------- 2019 {code} {code:bash} $ hdfs dfs -copyToLocal /user/hive/warehouse/testing3/20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9 ./ $ hexdump -C 20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9 {code} {code:none} 00000000 50 41 52 31 15 00 15 1c 15 40 2c 15 02 15 00 15 |PAR1.....@,.....| 00000010 06 15 08 1c 18 04 54 65 73 74 18 04 54 65 73 74 |......Test..Test| 00000020 16 00 00 00 00 1f 8b 08 00 00 00 00 00 00 03 63 |...............c| 00000030 62 60 60 60 66 64 01 92 21 a9 c5 25 00 fe 12 da |b```fd..!..%....| 00000040 11 0e 00 00 00 15 04 15 18 15 40 4c 15 02 15 04 |..........@L....| 00000050 00 00 1f 8b 08 00 00 00 00 00 00 03 73 48 2d 7b |............sH-{| 00000060 30 5f 89 81 a1 b7 45 95 01 00 64 bc 2a 14 0c 00 |0_....E...d.*...| 00000070 00 00 15 00 15 10 15 38 2c 15 02 15 04 15 06 15 |.......8,.......| 00000080 08 1c 18 0c 40 65 76 e0 9f 22 00 00 8d 84 25 00 |....@ev.."....%.| 00000090 18 0c 40 65 76 e0 9f 22 00 00 8d 84 25 00 16 00 |..@ev.."....%...| 000000a0 00 00 00 1f 8b 08 00 00 00 00 00 00 03 63 62 60 |.............cb`| 000000b0 60 60 66 64 60 06 00 77 4c 79 ad 08 00 00 00 15 |``fd`..wLy......| 000000c0 02 19 3c 48 0b 68 69 76 65 5f 73 63 68 65 6d 61 |..<H.hive_schema| 000000d0 15 04 00 15 0c 25 02 18 01 61 25 00 00 15 06 25 |.....%...a%....%| 000000e0 02 18 01 62 00 16 02 19 1c 19 2c 26 08 1c 15 0c |...b......,&....| 000000f0 19 35 08 00 06 19 18 01 61 15 04 16 02 16 5e 16 |.5......a.....^.| 00000100 82 01 26 08 3c 18 04 54 65 73 74 18 04 54 65 73 |..&.<..Test..Tes| 00000110 74 16 00 00 00 00 26 8a 01 1c 15 06 19 35 04 08 |t.....&......5..| 00000120 06 19 18 01 62 15 04 16 02 16 a4 01 16 f4 01 26 |....b..........&| 00000130 8a 01 3c 18 0c 40 65 76 e0 9f 22 00 00 8d 84 25 |..<..@ev.."....%| 00000140 00 18 0c 40 65 76 e0 9f 22 00 00 8d 84 25 00 16 |...@ev.."....%..| 00000150 00 00 00 00 16 82 02 16 02 00 28 0a 70 61 72 71 |..........(.parq| 00000160 75 65 74 2d 6d 72 00 a8 00 00 00 50 41 52 31 |uet-mr.....PAR1| 0000016f {code} {code:bash} $ strings 20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9 {code} {code:none} PAR1 Test Test cb```fd sH-{0_ cb```fd` hive_schema Test Test parquet-mr PAR1 {code} Any idea how I can save timestamps with Pandas and have them readable by Hive and Presto? -- This message was sent by Atlassian Jira (v8.3.4#803005)