Michael McCarthy created PARQUET-1235:
-----------------------------------------
Summary: Parquet-tools cat mangles strings created by other clients
Key: PARQUET-1235
URL: https://issues.apache.org/jira/browse/PARQUET-1235
Project: Parquet
Issue Type: Bug
Environment: {noformat}
uname -a
Linux myhost 4.4.0-63-generic #84-Ubuntu SMP Wed Feb 1 17:20:32 UTC 2017 x86_64
x86_64 x86_64 GNU/Linux
{noformat}
Reporter: Michael McCarthy
I have some parquet files that are created by Java MR process (which I do not
own). I am able to read these fields successfully in pig and Spark, but for
some reason the String fields are being mangled when I view the files with
parquet-tools (cat).
Here are the details on the file metadata using today's build of parquet-tools:
{noformat}
hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar meta <hdfs>/parquet-r-00000
{noformat}
Output:
{noformat}
file: hdfs://<path>/parquet-r-00000
creator: parquet-mr version 1.8.1 (build
4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
file schema: MY_DATA
--------------------------------------------------------------------------------
myfield: OPTIONAL BINARY R:0 D:1
row group 1: RC:37343 TS:32397576 OFFSET:4
--------------------------------------------------------------------------------
myfield: BINARY SNAPPY DO:0 FPO:4 SZ:273374/556406/2.04 VC:37343
ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[no stats for this column]
{noformat}
Has anyone seen this before?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)