[jira] [Created] (ARROW-6815) Timestamps saved via Pandas and PyArrow unreadable in Hive and Presto

Mark Litwintschik (Jira) Tue, 08 Oct 2019 04:04:22 -0700

Mark Litwintschik created ARROW-6815:
----------------------------------------


             Summary: Timestamps saved via Pandas and PyArrow unreadable in 
Hive and Presto
                 Key: ARROW-6815
                 URL: https://issues.apache.org/jira/browse/ARROW-6815
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.13.0
            Reporter: Mark Litwintschik


I'm unable to read timestamps saved as Parquet data via Pandas with Hive or 
Presto. These are the versions of the various pieces of software I'm using:

* Pandas 0.24.1
* PyArrow 0.13.0
* Hadoop 3.0.3
* Hive 2.3.3
* Presto 0.215

The Hadoop setup steps I took can be found 
[here|https://tech.marksblogg.com/hadoop-3-single-node-install-guide.html].

This is what I did to generate the Parquet file.

{code:python}
from   datetime import datetime
from   StringIO import StringIO

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


df = pd.DataFrame([{'a': 'Test', 'b': datetime.utcnow()}])

parquet_buffer = StringIO()
pa_table = pa.Table.from_pandas(df, preserve_index=False)
writer = pq.ParquetWriter(parquet_buffer, pa_table.schema)
writer.write_table(pa_table)
writer.close()

with open('example.pq', 'w+b') as f:
    f.write(parquet_buffer.getvalue())
{code}

This is its raw contents:

{code:bash}
$ hexdump -C example.pq
{code}


{code:none}
00000000  50 41 52 31 15 04 15 10  15 14 4c 15 02 15 04 12  |PAR1......L.....|
00000010  00 00 08 1c 04 00 00 00  54 65 73 74 15 00 15 12  |........Test....|
00000020  15 16 2c 15 02 15 04 15  06 15 06 1c 18 04 54 65  |..,...........Te|
00000030  73 74 18 04 54 65 73 74  16 00 00 00 00 09 20 02  |st..Test...... .|
00000040  00 00 00 02 01 01 02 00  26 90 01 1c 15 0c 19 35  |........&......5|
00000050  04 00 06 19 18 01 61 15  02 16 02 16 80 01 16 88  |......a.........|
00000060  01 26 38 26 08 1c 36 00  28 04 54 65 73 74 18 04  |.&8&..6.(.Test..|
00000070  54 65 73 74 00 00 00 15  04 15 10 15 14 4c 15 02  |Test.........L..|
00000080  15 04 12 00 00 08 1c 10  4f 48 96 63 94 05 00 15  |........OH.c....|
00000090  00 15 12 15 16 2c 15 02  15 04 15 06 15 06 1c 18  |.....,..........|
000000a0  08 10 4f 48 96 63 94 05  00 18 08 10 4f 48 96 63  |..OH.c......OH.c|
000000b0  94 05 00 16 00 00 00 00  09 20 02 00 00 00 02 01  |......... ......|
000000c0  01 02 00 26 86 03 1c 15  04 19 35 04 00 06 19 18  |...&......5.....|
000000d0  01 62 15 02 16 02 16 90  01 16 98 01 26 9e 02 26  |.b..........&..&|
000000e0  ee 01 1c 18 08 10 4f 48  96 63 94 05 00 18 08 10  |......OH.c......|
000000f0  4f 48 96 63 94 05 00 16  00 28 08 10 4f 48 96 63  |OH.c.....(..OH.c|
00000100  94 05 00 18 08 10 4f 48  96 63 94 05 00 00 00 00  |......OH.c......|
00000110  15 02 19 3c 35 00 18 06  73 63 68 65 6d 61 15 04  |...<5...schema..|
00000120  00 15 0c 25 02 18 01 61  00 15 04 25 02 18 01 62  |...%...a...%...b|
00000130  25 14 00 16 02 19 1c 19  2c 26 90 01 1c 15 0c 19  |%.......,&......|
00000140  35 04 00 06 19 18 01 61  15 02 16 02 16 80 01 16  |5......a........|
00000150  88 01 26 38 26 08 1c 36  00 28 04 54 65 73 74 18  |..&8&..6.(.Test.|
00000160  04 54 65 73 74 00 00 00  26 86 03 1c 15 04 19 35  |.Test...&......5|
00000170  04 00 06 19 18 01 62 15  02 16 02 16 90 01 16 98  |......b.........|
00000180  01 26 9e 02 26 ee 01 1c  18 08 10 4f 48 96 63 94  |.&..&......OH.c.|
00000190  05 00 18 08 10 4f 48 96  63 94 05 00 16 00 28 08  |.....OH.c.....(.|
000001a0  10 4f 48 96 63 94 05 00  18 08 10 4f 48 96 63 94  |.OH.c......OH.c.|
000001b0  05 00 00 00 00 16 a0 02  16 02 00 19 1c 18 06 70  |...............p|
000001c0  61 6e 64 61 73 18 db 02  7b 22 63 72 65 61 74 6f  |andas...{"creato|
000001d0  72 22 3a 20 7b 22 76 65  72 73 69 6f 6e 22 3a 20  |r": {"version": |
000001e0  22 30 2e 31 33 2e 30 22  2c 20 22 6c 69 62 72 61  |"0.13.0", "libra|
000001f0  72 79 22 3a 20 22 70 79  61 72 72 6f 77 22 7d 2c  |ry": "pyarrow"},|
00000200  20 22 70 61 6e 64 61 73  5f 76 65 72 73 69 6f 6e  | "pandas_version|
00000210  22 3a 20 6e 75 6c 6c 2c  20 22 69 6e 64 65 78 5f  |": null, "index_|
00000220  63 6f 6c 75 6d 6e 73 22  3a 20 5b 5d 2c 20 22 63  |columns": [], "c|
00000230  6f 6c 75 6d 6e 73 22 3a  20 5b 7b 22 6d 65 74 61  |olumns": [{"meta|
00000240  64 61 74 61 22 3a 20 6e  75 6c 6c 2c 20 22 66 69  |data": null, "fi|
00000250  65 6c 64 5f 6e 61 6d 65  22 3a 20 22 61 22 2c 20  |eld_name": "a", |
00000260  22 6e 61 6d 65 22 3a 20  22 61 22 2c 20 22 6e 75  |"name": "a", "nu|
00000270  6d 70 79 5f 74 79 70 65  22 3a 20 22 6f 62 6a 65  |mpy_type": "obje|
00000280  63 74 22 2c 20 22 70 61  6e 64 61 73 5f 74 79 70  |ct", "pandas_typ|
00000290  65 22 3a 20 22 62 79 74  65 73 22 7d 2c 20 7b 22  |e": "bytes"}, {"|
000002a0  6d 65 74 61 64 61 74 61  22 3a 20 6e 75 6c 6c 2c  |metadata": null,|
000002b0  20 22 66 69 65 6c 64 5f  6e 61 6d 65 22 3a 20 22  | "field_name": "|
000002c0  62 22 2c 20 22 6e 61 6d  65 22 3a 20 22 62 22 2c  |b", "name": "b",|
000002d0  20 22 6e 75 6d 70 79 5f  74 79 70 65 22 3a 20 22  | "numpy_type": "|
000002e0  64 61 74 65 74 69 6d 65  36 34 5b 6e 73 5d 22 2c  |datetime64[ns]",|
000002f0  20 22 70 61 6e 64 61 73  5f 74 79 70 65 22 3a 20  | "pandas_type": |
00000300  22 64 61 74 65 74 69 6d  65 22 7d 5d 2c 20 22 63  |"datetime"}], "c|
00000310  6f 6c 75 6d 6e 5f 69 6e  64 65 78 65 73 22 3a 20  |olumn_indexes": |
00000320  5b 5d 7d 00 18 22 70 61  72 71 75 65 74 2d 63 70  |[]}.."parquet-cp|
00000330  70 20 76 65 72 73 69 6f  6e 20 31 2e 35 2e 31 2d  |p version 1.5.1-|
00000340  53 4e 41 50 53 48 4f 54  19 2c 1c 00 00 1c 00 00  |SNAPSHOT.,......|
00000350  00 41 02 00 00 50 41 52  31                       |.A...PAR1|
00000359
{code}

These are the strings extracted from that file.

{code:bash}
$ strings example.pq
{code}

{code:none}
PAR1
Test
Test
Test
Test
Test
schema
Test
Test
pandas
{"creator": {"version": "0.13.0", "library": "pyarrow"}, "pandas_version": 
null, "index_columns": [], "columns": [{"metadata": null, "field_name": "a", 
"name": "a", "numpy_type": "object", "pandas_type": "bytes"}, {"metadata": 
null, "field_name": "b", "name": "b", "numpy_type": "datetime64[ns]", 
"pandas_type": "datetime"}], "column_indexes": []}
"parquet-cpp version 1.5.1-SNAPSHOT
PAR1
{code}

If I copy it onto HDFS I'm unable to read the timestamp. I've tried a variety 
of data types.

{code:bash}
$ hdfs dfs -mkdir /testing/
$ hdfs dfs -copyFromLocal example.pq /testing/
$ hdfs dfs -ls /testing/example.pq
{code}

{code:none}
-rw-r--r--   1 ubuntu supergroup        857 2019-10-08 10:26 /testing/example.pq
{code}

{code:bash}
$ hive
{code}

It can't be read as a DATETIME field:

{code:sql}
CREATE EXTERNAL TABLE testing (
    a STRING,
    b DATETIME
) STORED AS parquet
LOCATION '/testing/';
{code}

{code:none}
FAILED: SemanticException [Error 10099]: DATETIME type isn't supported yet. 
Please use DATE or TIMESTAMP instead
{code}

It can't be read as a TIMESTAMP:

{code:sql}
CREATE EXTERNAL TABLE testing (
    a STRING,
    b TIMESTAMP
) STORED AS parquet
LOCATION '/testing/';

SELECT * FROM testing;
{code}

{code:none}
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast 
to org.apache.hadoop.hive.serde2.io.TimestampWritable
{code}

I can extract it as a BIGINT but I'm unable to find a math formula to convert 
it to the correct time.

{code:sql}
DROP TABLE `testing`;

CREATE EXTERNAL TABLE `testing` (
    a STRING,
    b BIGINT
) STORED AS parquet
LOCATION '/testing/';
{code}

{code:none}
Test    1570530327547664
{code}

{code:sql}
SELECT CAST(b AS TIMESTAMP) FROM testing;
{code}

{code:none}
51738-02-15 08:19:077.664
{code}

{code:sql}
SELECT CAST(b/1000 AS TIMESTAMP) FROM testing;
{code}

{code:none}
51738-02-15 08:19:077.664
{code}

I'll generate the same content using Hive and show the contents of the Parquet 
file it produces:

{code:sql}
CREATE TABLE `testing2` (
    a STRING,
    b TIMESTAMP
) STORED AS parquet
LOCATION '/testing2/';

INSERT INTO testing2 (a, b) VALUES ('Test', "2019-10-08 10:34:30.061");

SELECT * FROM testing2;
{code}

{code:none}
Test    2019-10-08 10:34:30.061
{code}

The timestamp is usable.

{code:sql}
SELECT YEAR(B) FROM testing2;
{code}

{code:none}
2019
{code}

{code:bash}
$ hdfs dfs -copyToLocal /testing2/000000_0 ./
$ hexdump -C 000000_0
{code}

{code:none}
00000000  50 41 52 31 15 00 15 1c  15 1c 2c 15 02 15 00 15  |PAR1......,.....|
00000010  06 15 08 1c 18 04 54 65  73 74 18 04 54 65 73 74  |......Test..Test|
00000020  16 00 00 00 00 02 00 00  00 03 01 04 00 00 00 54  |...............T|
00000030  65 73 74 15 04 15 18 15  18 4c 15 02 15 04 00 00  |est......L......|
00000040  40 65 76 e0 9f 22 00 00  8d 84 25 00 15 00 15 10  |@ev.."....%.....|
00000050  15 10 2c 15 02 15 04 15  06 15 08 1c 18 0c 40 65  |..,...........@e|
00000060  76 e0 9f 22 00 00 8d 84  25 00 18 0c 40 65 76 e0  |v.."....%...@ev.|
00000070  9f 22 00 00 8d 84 25 00  16 00 00 00 00 02 00 00  |."....%.........|
00000080  00 03 01 00 03 15 02 19  3c 48 0b 68 69 76 65 5f  |........<H.hive_|
00000090  73 63 68 65 6d 61 15 04  00 15 0c 25 02 18 01 61  |schema.....%...a|
000000a0  25 00 00 15 06 25 02 18  01 62 00 16 02 19 1c 19  |%....%...b......|
000000b0  2c 26 08 1c 15 0c 19 35  08 06 00 19 18 01 61 15  |,&.....5......a.|
000000c0  00 16 02 16 5e 16 5e 26  08 3c 18 04 54 65 73 74  |....^.^&.<..Test|
000000d0  18 04 54 65 73 74 16 00  00 00 00 26 66 1c 15 06  |..Test.....&f...|
000000e0  19 35 08 04 06 19 18 01  62 15 00 16 02 16 a4 01  |.5......b.......|
000000f0  16 a4 01 26 66 3c 18 0c  40 65 76 e0 9f 22 00 00  |...&f<..@ev.."..|
00000100  8d 84 25 00 18 0c 40 65  76 e0 9f 22 00 00 8d 84  |..%...@ev.."....|
00000110  25 00 16 00 00 00 00 16  82 02 16 02 00 28 49 70  |%............(Ip|
00000120  61 72 71 75 65 74 2d 6d  72 20 76 65 72 73 69 6f  |arquet-mr versio|
00000130  6e 20 31 2e 38 2e 31 20  28 62 75 69 6c 64 20 34  |n 1.8.1 (build 4|
00000140  61 62 61 34 64 61 65 37  62 62 30 64 34 65 64 62  |aba4dae7bb0d4edb|
00000150  63 66 37 39 32 33 61 65  31 33 33 39 66 32 38 66  |cf7923ae1339f28f|
00000160  64 33 66 37 66 63 66 29  00 e4 00 00 00 50 41 52  |d3f7fcf).....PAR|
00000170  31                                                |1|
00000171
{code}

{code:bash}
$ strings 000000_0
{code}

{code:none}
PAR1
Test
Test
Test
hive_schema
Test
Test
(Iparquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
PAR1
{code}

I'll do the same for Presto as it has its own writer as well.

{code:bash}
$ presto --server localhost:8085 --schema default --catalog hive
{code}

{code:sql}
CREATE TABLE testing3 WITH (FORMAT='PARQUET') AS
    SELECT 'Test' as a,
           CAST('2019-10-08 10:34:30.061' AS TIMESTAMP) AS b;

SELECT * FROM testing3;
{code}

{code:none}
  a   |            b
------+-------------------------
 Test | 2019-10-08 10:34:30.061
{code}

{code:sql}
SELECT YEAR(b) FROM testing3;
{code}


{code:none}
 _col0
-------
  2019
{code}

{code:bash}
$ hdfs dfs -copyToLocal 
/user/hive/warehouse/testing3/20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9
 ./
$ hexdump -C 20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9
{code}

{code:none}
00000000  50 41 52 31 15 00 15 1c  15 40 2c 15 02 15 00 15  |PAR1.....@,.....|
00000010  06 15 08 1c 18 04 54 65  73 74 18 04 54 65 73 74  |......Test..Test|
00000020  16 00 00 00 00 1f 8b 08  00 00 00 00 00 00 03 63  |...............c|
00000030  62 60 60 60 66 64 01 92  21 a9 c5 25 00 fe 12 da  |b```fd..!..%....|
00000040  11 0e 00 00 00 15 04 15  18 15 40 4c 15 02 15 04  |..........@L....|
00000050  00 00 1f 8b 08 00 00 00  00 00 00 03 73 48 2d 7b  |............sH-{|
00000060  30 5f 89 81 a1 b7 45 95  01 00 64 bc 2a 14 0c 00  |0_....E...d.*...|
00000070  00 00 15 00 15 10 15 38  2c 15 02 15 04 15 06 15  |.......8,.......|
00000080  08 1c 18 0c 40 65 76 e0  9f 22 00 00 8d 84 25 00  |....@ev.."....%.|
00000090  18 0c 40 65 76 e0 9f 22  00 00 8d 84 25 00 16 00  |..@ev.."....%...|
000000a0  00 00 00 1f 8b 08 00 00  00 00 00 00 03 63 62 60  |.............cb`|
000000b0  60 60 66 64 60 06 00 77  4c 79 ad 08 00 00 00 15  |``fd`..wLy......|
000000c0  02 19 3c 48 0b 68 69 76  65 5f 73 63 68 65 6d 61  |..<H.hive_schema|
000000d0  15 04 00 15 0c 25 02 18  01 61 25 00 00 15 06 25  |.....%...a%....%|
000000e0  02 18 01 62 00 16 02 19  1c 19 2c 26 08 1c 15 0c  |...b......,&....|
000000f0  19 35 08 00 06 19 18 01  61 15 04 16 02 16 5e 16  |.5......a.....^.|
00000100  82 01 26 08 3c 18 04 54  65 73 74 18 04 54 65 73  |..&.<..Test..Tes|
00000110  74 16 00 00 00 00 26 8a  01 1c 15 06 19 35 04 08  |t.....&......5..|
00000120  06 19 18 01 62 15 04 16  02 16 a4 01 16 f4 01 26  |....b..........&|
00000130  8a 01 3c 18 0c 40 65 76  e0 9f 22 00 00 8d 84 25  |..<..@ev.."....%|
00000140  00 18 0c 40 65 76 e0 9f  22 00 00 8d 84 25 00 16  |...@ev.."....%..|
00000150  00 00 00 00 16 82 02 16  02 00 28 0a 70 61 72 71  |..........(.parq|
00000160  75 65 74 2d 6d 72 00 a8  00 00 00 50 41 52 31     |uet-mr.....PAR1|
0000016f
{code}

{code:bash}
$ strings 20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9
{code}

{code:none}
PAR1
Test
Test
cb```fd
sH-{0_
cb```fd`
hive_schema
Test
Test
parquet-mr
PAR1
{code}

Any idea how I can save timestamps with Pandas and have them readable by Hive 
and Presto?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-6815) Timestamps saved via Pandas and PyArrow unreadable in Hive and Presto

Reply via email to