New Parquet CLI project

2016-10-26 Thread Ryan Blue
Hi everyone,

Last Parquet sync-up, I mentioned that I've been working on a new Parquet
CLI tool (based on Cloudera's Kite CLI). I haven't had a chance to move the
build to maven or get the licensing taken care of for an Apache submission,
but it is clean enough that people can start looking at it. I've posted it
here:

  https://github.com/rdblue/parquet-cli

The build uses gradle and the jar is run with the hadoop command, like the
current tools. It is based on parquet-avro and can convert between Avro,
Parquet, CSV, and JSON. It has been a great tool for trying different
settings and having an easier time inspecting Parquet file
metadata/dictionaries.

Please have a look, I'm interested to know if anyone would like this added
to the Parquet project. Thanks!

rb

-- 
Ryan Blue
Software Engineer
Netflix


[jira] [Resolved] (PARQUET-723) parquet is not storing the type for the column.

2016-10-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PARQUET-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Peña resolved PARQUET-723.
-
Resolution: Won't Fix

I reported the issue in the Hive jira: 
https://issues.apache.org/jira/browse/HIVE-15079

> parquet is not storing the type for the column.
> ---
>
> Key: PARQUET-723
> URL: https://issues.apache.org/jira/browse/PARQUET-723
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Narasimha
>
> 1. Create Text file format table 
>   CREATE EXTERNAL TABLE IF NOT EXISTS emp(
>   id INT,
>   first_name STRING,
>   last_name STRING,
>   dateofBirth STRING,
>   join_date INT
>   )
>   COMMENT 'This is Employee Table Date Of Birth of type String'
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   STORED AS TEXTFILE
>   LOCATION '/user/employee/beforePartition';
> 2. Load the data into table
>   load data inpath '/user/somupoc_timestamp/employeeData_partitioned.csv' 
> into table emp;
>   select * from emp;
> 3. Create Partitioned table with file format as Parquet (dateofBirth STRING))
>   create external table emp_afterpartition(
>   id int, first_name STRING, last_name STRING, dateofBirth STRING)
>   COMMENT 'Employee partitioned table with dateofBirth of type string'
>   partitioned by (join_date int)
>   STORED as parquet
>   LOCATION '/user/employee/afterpartition';
> 4.  Fetch the data from Partitioned column
>   set hive.exec.dynamic.partition=true;  
>   set hive.exec.dynamic.partition.mode=nonstrict; 
>   insert overwrite table emp_afterpartition partition (join_date) select 
> * from emp;
>   select * from emp_afterpartition;
> 5. Create Partitioned table with file format as Parquet (dateofBirth 
> TIMESTAMP))
>   CREATE EXTERNAL TABLE IF NOT EXISTS 
> employee_afterpartition_timestamp_parq(
>   id INT,first_name STRING,last_name STRING,dateofBirth TIMESTAMP)
>   COMMENT 'employee partitioned table with dateofBirth of type TIMESTAMP'
>   PARTITIONED BY (join_date INT)
>   STORED AS PARQUET
>   LOCATION '/user/employee/afterpartition';
>   select * from employee_afterpartition_timestamp_parq;
> -- 0 records returned
>   impala ::   alter table employee_afterpartition_timestamp_parq 
> RECOVER PARTITIONS;
>   Hive :: MSCK REPAIR TABLE 
> employee_afterpartition_timestamp_parq;
>   -- MSCK works in Hive and  RECOVER PARTITIONS works in Impala -- 
> metastore check command with the repair table option:
>   select * from employee_afterpartition_timestamp_parq;
> Actual Result :: Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to 
> org.apache.hadoop.hive.serde2.io.TimestampWritable
> Expected Result :: Data should display
> Note: if file format is text file instead of Parquet then I am able to fetch 
> the data.
> Observation : Two tables having different column type pointing to same 
> location(HDFS ).
> sample Data
> =
> 1,Joyce,Garza,2016-07-17 14:42:18,201607
> 2,Jerry,Ortiz,2016-08-17 21:36:54,201608
> 3,Steven,Ryan,2016-09-10 01:32:40,201609
> 4,Lisa,Black,2015-10-12 15:05:13,201610
> 5,Jose,Turner,2015-011-10 06:38:40,201611
> 6,Joyce,Garza,2016-08-02,201608
> 7,Jerry,Ortiz,2016-01-01,201601
> 8,Steven,Ryan,2016/08/20,201608
> 9,Lisa,Black,2016/09/12,201609
> 10,Jose,Turner,09/19/2016,201609
> 11,Jose,Turner,20160915,201609



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-723) parquet is not storing the type for the column.

2016-10-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PARQUET-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609684#comment-15609684
 ] 

Sergio Peña commented on PARQUET-723:
-

The data is stored as STRING (or optional binary) in the first Parquet table. 
Then, we want to read a TIMESTAMP, and Hive expects to use the Timestamp 
inspector instead of a normal Text inspector.

This is not a Parquet bug, but a Hive missing feature similar to 'auto type 
widening' we were working for integers values.
See https://issues.apache.org/jira/browse/HIVE-14085

> parquet is not storing the type for the column.
> ---
>
> Key: PARQUET-723
> URL: https://issues.apache.org/jira/browse/PARQUET-723
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Narasimha
>
> 1. Create Text file format table 
>   CREATE EXTERNAL TABLE IF NOT EXISTS emp(
>   id INT,
>   first_name STRING,
>   last_name STRING,
>   dateofBirth STRING,
>   join_date INT
>   )
>   COMMENT 'This is Employee Table Date Of Birth of type String'
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   STORED AS TEXTFILE
>   LOCATION '/user/employee/beforePartition';
> 2. Load the data into table
>   load data inpath '/user/somupoc_timestamp/employeeData_partitioned.csv' 
> into table emp;
>   select * from emp;
> 3. Create Partitioned table with file format as Parquet (dateofBirth STRING))
>   create external table emp_afterpartition(
>   id int, first_name STRING, last_name STRING, dateofBirth STRING)
>   COMMENT 'Employee partitioned table with dateofBirth of type string'
>   partitioned by (join_date int)
>   STORED as parquet
>   LOCATION '/user/employee/afterpartition';
> 4.  Fetch the data from Partitioned column
>   set hive.exec.dynamic.partition=true;  
>   set hive.exec.dynamic.partition.mode=nonstrict; 
>   insert overwrite table emp_afterpartition partition (join_date) select 
> * from emp;
>   select * from emp_afterpartition;
> 5. Create Partitioned table with file format as Parquet (dateofBirth 
> TIMESTAMP))
>   CREATE EXTERNAL TABLE IF NOT EXISTS 
> employee_afterpartition_timestamp_parq(
>   id INT,first_name STRING,last_name STRING,dateofBirth TIMESTAMP)
>   COMMENT 'employee partitioned table with dateofBirth of type TIMESTAMP'
>   PARTITIONED BY (join_date INT)
>   STORED AS PARQUET
>   LOCATION '/user/employee/afterpartition';
>   select * from employee_afterpartition_timestamp_parq;
> -- 0 records returned
>   impala ::   alter table employee_afterpartition_timestamp_parq 
> RECOVER PARTITIONS;
>   Hive :: MSCK REPAIR TABLE 
> employee_afterpartition_timestamp_parq;
>   -- MSCK works in Hive and  RECOVER PARTITIONS works in Impala -- 
> metastore check command with the repair table option:
>   select * from employee_afterpartition_timestamp_parq;
> Actual Result :: Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to 
> org.apache.hadoop.hive.serde2.io.TimestampWritable
> Expected Result :: Data should display
> Note: if file format is text file instead of Parquet then I am able to fetch 
> the data.
> Observation : Two tables having different column type pointing to same 
> location(HDFS ).
> sample Data
> =
> 1,Joyce,Garza,2016-07-17 14:42:18,201607
> 2,Jerry,Ortiz,2016-08-17 21:36:54,201608
> 3,Steven,Ryan,2016-09-10 01:32:40,201609
> 4,Lisa,Black,2015-10-12 15:05:13,201610
> 5,Jose,Turner,2015-011-10 06:38:40,201611
> 6,Joyce,Garza,2016-08-02,201608
> 7,Jerry,Ortiz,2016-01-01,201601
> 8,Steven,Ryan,2016/08/20,201608
> 9,Lisa,Black,2016/09/12,201609
> 10,Jose,Turner,09/19/2016,201609
> 11,Jose,Turner,20160915,201609



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PARQUET-757) Bring Parquet logical types to par with Arrow

2016-10-26 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609566#comment-15609566
 ] 

Julien Le Dem edited comment on PARQUET-757 at 10/26/16 8:26 PM:
-

Those differences came up in https://github.com/apache/parquet-mr/pull/381


was (Author: julienledem):
Those difference came up in https://github.com/apache/parquet-mr/pull/381

> Bring Parquet logical types to par with Arrow
> -
>
> Key: PARQUET-757
> URL: https://issues.apache.org/jira/browse/PARQUET-757
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>
> Missing:
>  - Null
>  - Interval types
>  - Union
>  - half precision float



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PARQUET-675) Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types

2016-10-26 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-675:
-

Assignee: Julien Le Dem

> Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types
> ---
>
> Key: PARQUET-675
> URL: https://issues.apache.org/jira/browse/PARQUET-675
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>
> For completeness and compatibility with Arrow and SQL types.
> Those are related to the existing INTERVAL type.
> some references:
>  - https://msdn.microsoft.com/en-us/library/ms716506(v=vs.85).aspx
>  - 
> http://www.techrepublic.com/article/sql-basics-datetime-and-interval-data-types/
>  - https://www.postgresql.org/docs/9.3/static/datatype-datetime.html
>  - https://docs.oracle.com/html/E26088_01/sql_elements001.htm
>  - 
> http://www.ibm.com/support/knowledgecenter/SSGU8G_12.1.0/com.ibm.sqlr.doc/ids_sqr_123.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-757) Bring Parquet logical types to par with Arrow

2016-10-26 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-757:
--
Description: 
Missing:
 - Null
 - Interval types
 - Union
 - half precision float


  was:
Missing:
 - Null
 - Interval types
 - Union
 - Short float



> Bring Parquet logical types to par with Arrow
> -
>
> Key: PARQUET-757
> URL: https://issues.apache.org/jira/browse/PARQUET-757
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>
> Missing:
>  - Null
>  - Interval types
>  - Union
>  - half precision float



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-757) Bring Parquet logical types to par with Arrow

2016-10-26 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-757:
-

 Summary: Bring Parquet logical types to par with Arrow
 Key: PARQUET-757
 URL: https://issues.apache.org/jira/browse/PARQUET-757
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Julien Le Dem
Assignee: Julien Le Dem


Missing:
 - Null
 - Interval types
 - Union
 - Short float




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-723) parquet is not storing the type for the column.

2016-10-26 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609272#comment-15609272
 ] 

Julien Le Dem commented on PARQUET-723:
---

It looks like a bug/missing feature in Hive.
[~spena] What do you think?

> parquet is not storing the type for the column.
> ---
>
> Key: PARQUET-723
> URL: https://issues.apache.org/jira/browse/PARQUET-723
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Narasimha
>
> 1. Create Text file format table 
>   CREATE EXTERNAL TABLE IF NOT EXISTS emp(
>   id INT,
>   first_name STRING,
>   last_name STRING,
>   dateofBirth STRING,
>   join_date INT
>   )
>   COMMENT 'This is Employee Table Date Of Birth of type String'
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   STORED AS TEXTFILE
>   LOCATION '/user/employee/beforePartition';
> 2. Load the data into table
>   load data inpath '/user/somupoc_timestamp/employeeData_partitioned.csv' 
> into table emp;
>   select * from emp;
> 3. Create Partitioned table with file format as Parquet (dateofBirth STRING))
>   create external table emp_afterpartition(
>   id int, first_name STRING, last_name STRING, dateofBirth STRING)
>   COMMENT 'Employee partitioned table with dateofBirth of type string'
>   partitioned by (join_date int)
>   STORED as parquet
>   LOCATION '/user/employee/afterpartition';
> 4.  Fetch the data from Partitioned column
>   set hive.exec.dynamic.partition=true;  
>   set hive.exec.dynamic.partition.mode=nonstrict; 
>   insert overwrite table emp_afterpartition partition (join_date) select 
> * from emp;
>   select * from emp_afterpartition;
> 5. Create Partitioned table with file format as Parquet (dateofBirth 
> TIMESTAMP))
>   CREATE EXTERNAL TABLE IF NOT EXISTS 
> employee_afterpartition_timestamp_parq(
>   id INT,first_name STRING,last_name STRING,dateofBirth TIMESTAMP)
>   COMMENT 'employee partitioned table with dateofBirth of type TIMESTAMP'
>   PARTITIONED BY (join_date INT)
>   STORED AS PARQUET
>   LOCATION '/user/employee/afterpartition';
>   select * from employee_afterpartition_timestamp_parq;
> -- 0 records returned
>   impala ::   alter table employee_afterpartition_timestamp_parq 
> RECOVER PARTITIONS;
>   Hive :: MSCK REPAIR TABLE 
> employee_afterpartition_timestamp_parq;
>   -- MSCK works in Hive and  RECOVER PARTITIONS works in Impala -- 
> metastore check command with the repair table option:
>   select * from employee_afterpartition_timestamp_parq;
> Actual Result :: Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to 
> org.apache.hadoop.hive.serde2.io.TimestampWritable
> Expected Result :: Data should display
> Note: if file format is text file instead of Parquet then I am able to fetch 
> the data.
> Observation : Two tables having different column type pointing to same 
> location(HDFS ).
> sample Data
> =
> 1,Joyce,Garza,2016-07-17 14:42:18,201607
> 2,Jerry,Ortiz,2016-08-17 21:36:54,201608
> 3,Steven,Ryan,2016-09-10 01:32:40,201609
> 4,Lisa,Black,2015-10-12 15:05:13,201610
> 5,Jose,Turner,2015-011-10 06:38:40,201611
> 6,Joyce,Garza,2016-08-02,201608
> 7,Jerry,Ortiz,2016-01-01,201601
> 8,Steven,Ryan,2016/08/20,201608
> 9,Lisa,Black,2016/09/12,201609
> 10,Jose,Turner,09/19/2016,201609
> 11,Jose,Turner,20160915,201609



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-756) Add Union Logical type

2016-10-26 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-756:
-

 Summary: Add Union Logical type
 Key: PARQUET-756
 URL: https://issues.apache.org/jira/browse/PARQUET-756
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Julien Le Dem
Assignee: Julien Le Dem


Add a union type annotation for Group types that represent a Union rather than 
a struct.
Models like Avro or Arrow would make use of it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-753) GroupType.union() doesn't merge the original type

2016-10-26 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-753.
---
Resolution: Fixed

https://github.com/apache/parquet-mr/commit/e5cd652aeb3305ef2b82a7925cce3a132bf6f5ae

> GroupType.union() doesn't merge the original type
> -
>
> Key: PARQUET-753
> URL: https://issues.apache.org/jira/browse/PARQUET-753
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.1
>Reporter: Deneche A. Hakim
>
> When merging two GroupType, the union() method doesn't merge their original 
> type which will be lost after the union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)