New Parquet CLI project
Hi everyone, Last Parquet sync-up, I mentioned that I've been working on a new Parquet CLI tool (based on Cloudera's Kite CLI). I haven't had a chance to move the build to maven or get the licensing taken care of for an Apache submission, but it is clean enough that people can start looking at it. I've posted it here: https://github.com/rdblue/parquet-cli The build uses gradle and the jar is run with the hadoop command, like the current tools. It is based on parquet-avro and can convert between Avro, Parquet, CSV, and JSON. It has been a great tool for trying different settings and having an easier time inspecting Parquet file metadata/dictionaries. Please have a look, I'm interested to know if anyone would like this added to the Parquet project. Thanks! rb -- Ryan Blue Software Engineer Netflix
[jira] [Resolved] (PARQUET-723) parquet is not storing the type for the column.
[ https://issues.apache.org/jira/browse/PARQUET-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Peña resolved PARQUET-723. - Resolution: Won't Fix I reported the issue in the Hive jira: https://issues.apache.org/jira/browse/HIVE-15079 > parquet is not storing the type for the column. > --- > > Key: PARQUET-723 > URL: https://issues.apache.org/jira/browse/PARQUET-723 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Narasimha > > 1. Create Text file format table > CREATE EXTERNAL TABLE IF NOT EXISTS emp( > id INT, > first_name STRING, > last_name STRING, > dateofBirth STRING, > join_date INT > ) > COMMENT 'This is Employee Table Date Of Birth of type String' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > LINES TERMINATED BY '\n' > STORED AS TEXTFILE > LOCATION '/user/employee/beforePartition'; > 2. Load the data into table > load data inpath '/user/somupoc_timestamp/employeeData_partitioned.csv' > into table emp; > select * from emp; > 3. Create Partitioned table with file format as Parquet (dateofBirth STRING)) > create external table emp_afterpartition( > id int, first_name STRING, last_name STRING, dateofBirth STRING) > COMMENT 'Employee partitioned table with dateofBirth of type string' > partitioned by (join_date int) > STORED as parquet > LOCATION '/user/employee/afterpartition'; > 4. Fetch the data from Partitioned column > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > insert overwrite table emp_afterpartition partition (join_date) select > * from emp; > select * from emp_afterpartition; > 5. Create Partitioned table with file format as Parquet (dateofBirth > TIMESTAMP)) > CREATE EXTERNAL TABLE IF NOT EXISTS > employee_afterpartition_timestamp_parq( > id INT,first_name STRING,last_name STRING,dateofBirth TIMESTAMP) > COMMENT 'employee partitioned table with dateofBirth of type TIMESTAMP' > PARTITIONED BY (join_date INT) > STORED AS PARQUET > LOCATION '/user/employee/afterpartition'; > select * from employee_afterpartition_timestamp_parq; > -- 0 records returned > impala :: alter table employee_afterpartition_timestamp_parq > RECOVER PARTITIONS; > Hive :: MSCK REPAIR TABLE > employee_afterpartition_timestamp_parq; > -- MSCK works in Hive and RECOVER PARTITIONS works in Impala -- > metastore check command with the repair table option: > select * from employee_afterpartition_timestamp_parq; > Actual Result :: Failed with exception > java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to > org.apache.hadoop.hive.serde2.io.TimestampWritable > Expected Result :: Data should display > Note: if file format is text file instead of Parquet then I am able to fetch > the data. > Observation : Two tables having different column type pointing to same > location(HDFS ). > sample Data > = > 1,Joyce,Garza,2016-07-17 14:42:18,201607 > 2,Jerry,Ortiz,2016-08-17 21:36:54,201608 > 3,Steven,Ryan,2016-09-10 01:32:40,201609 > 4,Lisa,Black,2015-10-12 15:05:13,201610 > 5,Jose,Turner,2015-011-10 06:38:40,201611 > 6,Joyce,Garza,2016-08-02,201608 > 7,Jerry,Ortiz,2016-01-01,201601 > 8,Steven,Ryan,2016/08/20,201608 > 9,Lisa,Black,2016/09/12,201609 > 10,Jose,Turner,09/19/2016,201609 > 11,Jose,Turner,20160915,201609 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-723) parquet is not storing the type for the column.
[ https://issues.apache.org/jira/browse/PARQUET-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609684#comment-15609684 ] Sergio Peña commented on PARQUET-723: - The data is stored as STRING (or optional binary) in the first Parquet table. Then, we want to read a TIMESTAMP, and Hive expects to use the Timestamp inspector instead of a normal Text inspector. This is not a Parquet bug, but a Hive missing feature similar to 'auto type widening' we were working for integers values. See https://issues.apache.org/jira/browse/HIVE-14085 > parquet is not storing the type for the column. > --- > > Key: PARQUET-723 > URL: https://issues.apache.org/jira/browse/PARQUET-723 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Narasimha > > 1. Create Text file format table > CREATE EXTERNAL TABLE IF NOT EXISTS emp( > id INT, > first_name STRING, > last_name STRING, > dateofBirth STRING, > join_date INT > ) > COMMENT 'This is Employee Table Date Of Birth of type String' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > LINES TERMINATED BY '\n' > STORED AS TEXTFILE > LOCATION '/user/employee/beforePartition'; > 2. Load the data into table > load data inpath '/user/somupoc_timestamp/employeeData_partitioned.csv' > into table emp; > select * from emp; > 3. Create Partitioned table with file format as Parquet (dateofBirth STRING)) > create external table emp_afterpartition( > id int, first_name STRING, last_name STRING, dateofBirth STRING) > COMMENT 'Employee partitioned table with dateofBirth of type string' > partitioned by (join_date int) > STORED as parquet > LOCATION '/user/employee/afterpartition'; > 4. Fetch the data from Partitioned column > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > insert overwrite table emp_afterpartition partition (join_date) select > * from emp; > select * from emp_afterpartition; > 5. Create Partitioned table with file format as Parquet (dateofBirth > TIMESTAMP)) > CREATE EXTERNAL TABLE IF NOT EXISTS > employee_afterpartition_timestamp_parq( > id INT,first_name STRING,last_name STRING,dateofBirth TIMESTAMP) > COMMENT 'employee partitioned table with dateofBirth of type TIMESTAMP' > PARTITIONED BY (join_date INT) > STORED AS PARQUET > LOCATION '/user/employee/afterpartition'; > select * from employee_afterpartition_timestamp_parq; > -- 0 records returned > impala :: alter table employee_afterpartition_timestamp_parq > RECOVER PARTITIONS; > Hive :: MSCK REPAIR TABLE > employee_afterpartition_timestamp_parq; > -- MSCK works in Hive and RECOVER PARTITIONS works in Impala -- > metastore check command with the repair table option: > select * from employee_afterpartition_timestamp_parq; > Actual Result :: Failed with exception > java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to > org.apache.hadoop.hive.serde2.io.TimestampWritable > Expected Result :: Data should display > Note: if file format is text file instead of Parquet then I am able to fetch > the data. > Observation : Two tables having different column type pointing to same > location(HDFS ). > sample Data > = > 1,Joyce,Garza,2016-07-17 14:42:18,201607 > 2,Jerry,Ortiz,2016-08-17 21:36:54,201608 > 3,Steven,Ryan,2016-09-10 01:32:40,201609 > 4,Lisa,Black,2015-10-12 15:05:13,201610 > 5,Jose,Turner,2015-011-10 06:38:40,201611 > 6,Joyce,Garza,2016-08-02,201608 > 7,Jerry,Ortiz,2016-01-01,201601 > 8,Steven,Ryan,2016/08/20,201608 > 9,Lisa,Black,2016/09/12,201609 > 10,Jose,Turner,09/19/2016,201609 > 11,Jose,Turner,20160915,201609 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PARQUET-757) Bring Parquet logical types to par with Arrow
[ https://issues.apache.org/jira/browse/PARQUET-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609566#comment-15609566 ] Julien Le Dem edited comment on PARQUET-757 at 10/26/16 8:26 PM: - Those differences came up in https://github.com/apache/parquet-mr/pull/381 was (Author: julienledem): Those difference came up in https://github.com/apache/parquet-mr/pull/381 > Bring Parquet logical types to par with Arrow > - > > Key: PARQUET-757 > URL: https://issues.apache.org/jira/browse/PARQUET-757 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Assignee: Julien Le Dem > > Missing: > - Null > - Interval types > - Union > - half precision float -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PARQUET-675) Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types
[ https://issues.apache.org/jira/browse/PARQUET-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem reassigned PARQUET-675: - Assignee: Julien Le Dem > Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types > --- > > Key: PARQUET-675 > URL: https://issues.apache.org/jira/browse/PARQUET-675 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Assignee: Julien Le Dem > > For completeness and compatibility with Arrow and SQL types. > Those are related to the existing INTERVAL type. > some references: > - https://msdn.microsoft.com/en-us/library/ms716506(v=vs.85).aspx > - > http://www.techrepublic.com/article/sql-basics-datetime-and-interval-data-types/ > - https://www.postgresql.org/docs/9.3/static/datatype-datetime.html > - https://docs.oracle.com/html/E26088_01/sql_elements001.htm > - > http://www.ibm.com/support/knowledgecenter/SSGU8G_12.1.0/com.ibm.sqlr.doc/ids_sqr_123.htm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-757) Bring Parquet logical types to par with Arrow
[ https://issues.apache.org/jira/browse/PARQUET-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem updated PARQUET-757: -- Description: Missing: - Null - Interval types - Union - half precision float was: Missing: - Null - Interval types - Union - Short float > Bring Parquet logical types to par with Arrow > - > > Key: PARQUET-757 > URL: https://issues.apache.org/jira/browse/PARQUET-757 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Julien Le Dem >Assignee: Julien Le Dem > > Missing: > - Null > - Interval types > - Union > - half precision float -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-757) Bring Parquet logical types to par with Arrow
Julien Le Dem created PARQUET-757: - Summary: Bring Parquet logical types to par with Arrow Key: PARQUET-757 URL: https://issues.apache.org/jira/browse/PARQUET-757 Project: Parquet Issue Type: Improvement Components: parquet-format Reporter: Julien Le Dem Assignee: Julien Le Dem Missing: - Null - Interval types - Union - Short float -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-723) parquet is not storing the type for the column.
[ https://issues.apache.org/jira/browse/PARQUET-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609272#comment-15609272 ] Julien Le Dem commented on PARQUET-723: --- It looks like a bug/missing feature in Hive. [~spena] What do you think? > parquet is not storing the type for the column. > --- > > Key: PARQUET-723 > URL: https://issues.apache.org/jira/browse/PARQUET-723 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Narasimha > > 1. Create Text file format table > CREATE EXTERNAL TABLE IF NOT EXISTS emp( > id INT, > first_name STRING, > last_name STRING, > dateofBirth STRING, > join_date INT > ) > COMMENT 'This is Employee Table Date Of Birth of type String' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > LINES TERMINATED BY '\n' > STORED AS TEXTFILE > LOCATION '/user/employee/beforePartition'; > 2. Load the data into table > load data inpath '/user/somupoc_timestamp/employeeData_partitioned.csv' > into table emp; > select * from emp; > 3. Create Partitioned table with file format as Parquet (dateofBirth STRING)) > create external table emp_afterpartition( > id int, first_name STRING, last_name STRING, dateofBirth STRING) > COMMENT 'Employee partitioned table with dateofBirth of type string' > partitioned by (join_date int) > STORED as parquet > LOCATION '/user/employee/afterpartition'; > 4. Fetch the data from Partitioned column > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > insert overwrite table emp_afterpartition partition (join_date) select > * from emp; > select * from emp_afterpartition; > 5. Create Partitioned table with file format as Parquet (dateofBirth > TIMESTAMP)) > CREATE EXTERNAL TABLE IF NOT EXISTS > employee_afterpartition_timestamp_parq( > id INT,first_name STRING,last_name STRING,dateofBirth TIMESTAMP) > COMMENT 'employee partitioned table with dateofBirth of type TIMESTAMP' > PARTITIONED BY (join_date INT) > STORED AS PARQUET > LOCATION '/user/employee/afterpartition'; > select * from employee_afterpartition_timestamp_parq; > -- 0 records returned > impala :: alter table employee_afterpartition_timestamp_parq > RECOVER PARTITIONS; > Hive :: MSCK REPAIR TABLE > employee_afterpartition_timestamp_parq; > -- MSCK works in Hive and RECOVER PARTITIONS works in Impala -- > metastore check command with the repair table option: > select * from employee_afterpartition_timestamp_parq; > Actual Result :: Failed with exception > java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to > org.apache.hadoop.hive.serde2.io.TimestampWritable > Expected Result :: Data should display > Note: if file format is text file instead of Parquet then I am able to fetch > the data. > Observation : Two tables having different column type pointing to same > location(HDFS ). > sample Data > = > 1,Joyce,Garza,2016-07-17 14:42:18,201607 > 2,Jerry,Ortiz,2016-08-17 21:36:54,201608 > 3,Steven,Ryan,2016-09-10 01:32:40,201609 > 4,Lisa,Black,2015-10-12 15:05:13,201610 > 5,Jose,Turner,2015-011-10 06:38:40,201611 > 6,Joyce,Garza,2016-08-02,201608 > 7,Jerry,Ortiz,2016-01-01,201601 > 8,Steven,Ryan,2016/08/20,201608 > 9,Lisa,Black,2016/09/12,201609 > 10,Jose,Turner,09/19/2016,201609 > 11,Jose,Turner,20160915,201609 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-756) Add Union Logical type
Julien Le Dem created PARQUET-756: - Summary: Add Union Logical type Key: PARQUET-756 URL: https://issues.apache.org/jira/browse/PARQUET-756 Project: Parquet Issue Type: Improvement Components: parquet-format Reporter: Julien Le Dem Assignee: Julien Le Dem Add a union type annotation for Group types that represent a Union rather than a struct. Models like Avro or Arrow would make use of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-753) GroupType.union() doesn't merge the original type
[ https://issues.apache.org/jira/browse/PARQUET-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem resolved PARQUET-753. --- Resolution: Fixed https://github.com/apache/parquet-mr/commit/e5cd652aeb3305ef2b82a7925cce3a132bf6f5ae > GroupType.union() doesn't merge the original type > - > > Key: PARQUET-753 > URL: https://issues.apache.org/jira/browse/PARQUET-753 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.1 >Reporter: Deneche A. Hakim > > When merging two GroupType, the union() method doesn't merge their original > type which will be lost after the union. -- This message was sent by Atlassian JIRA (v6.3.4#6332)