[jira] [Resolved] (PARQUET-724) Test more advanced properties setting

2016-09-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-724.
--
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 166
[https://github.com/apache/parquet-cpp/pull/166]

> Test more advanced properties setting
> -
>
> Key: PARQUET-724
> URL: https://issues.apache.org/jira/browse/PARQUET-724
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
> Fix For: cpp-0.1
>
>
> Test that handling of global and column specific is tested and behaving 
> correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (PARQUET-722) Building with JDK 8 fails over a maven bug

2016-09-21 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem closed PARQUET-722.
-

> Building with JDK 8 fails over a maven bug
> --
>
> Key: PARQUET-722
> URL: https://issues.apache.org/jira/browse/PARQUET-722
> Project: Parquet
>  Issue Type: Bug
>Reporter: Niels Basjes
>
> When I build parquet on my system I get this error during the build:
> {quote}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) 
> on project parquet-generator: Error rendering velocity resource. 
> NullPointerException -> [Help 1]
> {quote}
> About a year ago [~julienledem] responded that this is caused due to a bug in 
> Maven in combination with Java 8:
> At this page 
> http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512
>  
> Now this bug has been solved at the Maven end in maven-filtering 1.2
> https://issues.apache.org/jira/browse/MSHARED-319
> The problem is that this fix has not yet been integrated into the latest 
> available maven versions yet.
> I'll put up a pull request with a proposed fix for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-722) Building with JDK 8 fails over a maven bug

2016-09-21 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511173#comment-15511173
 ] 

Julien Le Dem commented on PARQUET-722:
---

Thanks for spending the time!

> Building with JDK 8 fails over a maven bug
> --
>
> Key: PARQUET-722
> URL: https://issues.apache.org/jira/browse/PARQUET-722
> Project: Parquet
>  Issue Type: Bug
>Reporter: Niels Basjes
>
> When I build parquet on my system I get this error during the build:
> {quote}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) 
> on project parquet-generator: Error rendering velocity resource. 
> NullPointerException -> [Help 1]
> {quote}
> About a year ago [~julienledem] responded that this is caused due to a bug in 
> Maven in combination with Java 8:
> At this page 
> http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512
>  
> Now this bug has been solved at the Maven end in maven-filtering 1.2
> https://issues.apache.org/jira/browse/MSHARED-319
> The problem is that this fix has not yet been integrated into the latest 
> available maven versions yet.
> I'll put up a pull request with a proposed fix for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Python Parquet package

2016-09-21 Thread Wes McKinney
Sure, I'm happy to do that. Do you want me to take care of refactoring
to account for the arrow::io API changes I just made? Then we can go
ahead and remove arrow/parquet from the Arrow project.

On Wed, Sep 21, 2016 at 3:47 PM, Uwe Korn  wrote:
> Sounds reasonable for me. I will then to continue to implement the missing 
> interfaces for Parquet in pyarrow.parquet.
>
> @wesm Can you take care that we easily depend on a pinned version of 
> parquet-cpp in pyarrow’s travis builds?
>
> Uwe
>
>> Am 21.09.2016 um 20:07 schrieb Wes McKinney :
>>
>> I don't agree with this approach right now. Here are my reasons:
>>
>> 1. The Parquet Python integration will need to depend both on PyArrow
>> and the Arrow C++ libraries, so these libraries would generally need
>> to be developed together
>>
>> 2. PyArrow would need to define and maintain a C++ or Cython API so
>> that the equivalent of the current pyarrow.parquet library can access
>> C-level data. For example:
>>
>> https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31
>>
>> Cython does permit cross-project C API access (we are already doing
>> cross-module Cython APi access within pyarrow). This adds additional
>> complexity that I think we should avoid for now.
>>
>> 3. Maintaining a separate C++ build toolchain for a Python package
>> adds additional maintenance and packaging burden on us
>>
>> My inclination is to keep the code where it is and make the Parquet
>> extension optional.
>>
>> - Wes
>>
>> On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn  wrote:
>>> Hello,
>>>
>>> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we
>>> still have to decide on how we are going to proceed with the Arrow<->Parquet
>>> Python integration. For the moment, it seems that the best way to go ahead
>>> is to pull the pyarrow.parquet module out into a separate Python package.
>>> From an organisational point, I'm unclear how I should proceed here. Should
>>> we put this in a separate repo? If so, as part of the Apache organisation?
>>>
>>> Uwe
>


[jira] [Updated] (PARQUET-724) Test more advanced properties setting

2016-09-21 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-724:

Description: Test that handling of global and column specific is tested and 
behaving correctly.

> Test more advanced properties setting
> -
>
> Key: PARQUET-724
> URL: https://issues.apache.org/jira/browse/PARQUET-724
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>
> Test that handling of global and column specific is tested and behaving 
> correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-724) Test more advanced properties setting

2016-09-21 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511056#comment-15511056
 ] 

Uwe L. Korn commented on PARQUET-724:
-

https://github.com/apache/parquet-cpp/pull/166

> Test more advanced properties setting
> -
>
> Key: PARQUET-724
> URL: https://issues.apache.org/jira/browse/PARQUET-724
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-724) Test more advanced properties setting

2016-09-21 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created PARQUET-724:
---

 Summary: Test more advanced properties setting
 Key: PARQUET-724
 URL: https://issues.apache.org/jira/browse/PARQUET-724
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Python Parquet package

2016-09-21 Thread Uwe Korn
Sounds reasonable for me. I will then to continue to implement the missing 
interfaces for Parquet in pyarrow.parquet. 

@wesm Can you take care that we easily depend on a pinned version of 
parquet-cpp in pyarrow’s travis builds?

Uwe

> Am 21.09.2016 um 20:07 schrieb Wes McKinney :
> 
> I don't agree with this approach right now. Here are my reasons:
> 
> 1. The Parquet Python integration will need to depend both on PyArrow
> and the Arrow C++ libraries, so these libraries would generally need
> to be developed together
> 
> 2. PyArrow would need to define and maintain a C++ or Cython API so
> that the equivalent of the current pyarrow.parquet library can access
> C-level data. For example:
> 
> https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31
> 
> Cython does permit cross-project C API access (we are already doing
> cross-module Cython APi access within pyarrow). This adds additional
> complexity that I think we should avoid for now.
> 
> 3. Maintaining a separate C++ build toolchain for a Python package
> adds additional maintenance and packaging burden on us
> 
> My inclination is to keep the code where it is and make the Parquet
> extension optional.
> 
> - Wes
> 
> On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn  wrote:
>> Hello,
>> 
>> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we
>> still have to decide on how we are going to proceed with the Arrow<->Parquet
>> Python integration. For the moment, it seems that the best way to go ahead
>> is to pull the pyarrow.parquet module out into a separate Python package.
>> From an organisational point, I'm unclear how I should proceed here. Should
>> we put this in a separate repo? If so, as part of the Apache organisation?
>> 
>> Uwe



Re: Python Parquet package

2016-09-21 Thread Wes McKinney
I don't agree with this approach right now. Here are my reasons:

1. The Parquet Python integration will need to depend both on PyArrow
and the Arrow C++ libraries, so these libraries would generally need
to be developed together

2. PyArrow would need to define and maintain a C++ or Cython API so
that the equivalent of the current pyarrow.parquet library can access
C-level data. For example:

https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31

Cython does permit cross-project C API access (we are already doing
cross-module Cython APi access within pyarrow). This adds additional
complexity that I think we should avoid for now.

3. Maintaining a separate C++ build toolchain for a Python package
adds additional maintenance and packaging burden on us

My inclination is to keep the code where it is and make the Parquet
extension optional.

- Wes

On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn  wrote:
> Hello,
>
> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we
> still have to decide on how we are going to proceed with the Arrow<->Parquet
> Python integration. For the moment, it seems that the best way to go ahead
> is to pull the pyarrow.parquet module out into a separate Python package.
> From an organisational point, I'm unclear how I should proceed here. Should
> we put this in a separate repo? If so, as part of the Apache organisation?
>
> Uwe


[jira] [Created] (PARQUET-723) parquet is not storing the type for the column.

2016-09-21 Thread Narasimha (JIRA)
Narasimha created PARQUET-723:
-

 Summary: parquet is not storing the type for the column.
 Key: PARQUET-723
 URL: https://issues.apache.org/jira/browse/PARQUET-723
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Narasimha


1. Create Text file format table 
CREATE EXTERNAL TABLE IF NOT EXISTS emp(
id INT,
first_name STRING,
last_name STRING,
dateofBirth STRING,
join_date INT
)
COMMENT 'This is Employee Table Date Of Birth of type String'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/employee/beforePartition';

2. Load the data into table
load data inpath '/user/somupoc_timestamp/employeeData_partitioned.csv' 
into table emp;
select * from emp;

3. Create Partitioned table with file format as Parquet (dateofBirth STRING))

create external table emp_afterpartition(
id int, first_name STRING, last_name STRING, dateofBirth STRING)
COMMENT 'Employee partitioned table with dateofBirth of type string'
partitioned by (join_date int)
STORED as parquet
LOCATION '/user/employee/afterpartition';

4.  Fetch the data from Partitioned column

set hive.exec.dynamic.partition=true;  
set hive.exec.dynamic.partition.mode=nonstrict; 
insert overwrite table emp_afterpartition partition (join_date) select 
* from emp;
select * from emp_afterpartition;
5. Create Partitioned table with file format as Parquet (dateofBirth TIMESTAMP))

CREATE EXTERNAL TABLE IF NOT EXISTS 
employee_afterpartition_timestamp_parq(
id INT,first_name STRING,last_name STRING,dateofBirth TIMESTAMP)
COMMENT 'employee partitioned table with dateofBirth of type TIMESTAMP'
PARTITIONED BY (join_date INT)
STORED AS PARQUET
LOCATION '/user/employee/afterpartition';

select * from employee_afterpartition_timestamp_parq;
-- 0 records returned
impala ::   alter table employee_afterpartition_timestamp_parq 
RECOVER PARTITIONS;
Hive :: MSCK REPAIR TABLE 
employee_afterpartition_timestamp_parq;
-- MSCK works in Hive and  RECOVER PARTITIONS works in Impala -- 
metastore check command with the repair table option:

select * from employee_afterpartition_timestamp_parq;

Actual Result :: Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to 
org.apache.hadoop.hive.serde2.io.TimestampWritable

Expected Result :: Data should display

Note: if file format is text file instead of Parquet then I am able to fetch 
the data.
Observation : Two tables having different column type pointing to same 
location(HDFS ).

sample Data
=

1,Joyce,Garza,2016-07-17 14:42:18,201607
2,Jerry,Ortiz,2016-08-17 21:36:54,201608
3,Steven,Ryan,2016-09-10 01:32:40,201609
4,Lisa,Black,2015-10-12 15:05:13,201610
5,Jose,Turner,2015-011-10 06:38:40,201611
6,Joyce,Garza,2016-08-02,201608
7,Jerry,Ortiz,2016-01-01,201601
8,Steven,Ryan,2016/08/20,201608
9,Lisa,Black,2016/09/12,201609
10,Jose,Turner,09/19/2016,201609
11,Jose,Turner,20160915,201609





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Python Parquet package

2016-09-21 Thread Uwe Korn

Hello,

as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, 
we still have to decide on how we are going to proceed with the 
Arrow<->Parquet Python integration. For the moment, it seems that the 
best way to go ahead is to pull the pyarrow.parquet module out into a 
separate Python package. From an organisational point, I'm unclear how I 
should proceed here. Should we put this in a separate repo? If so, as 
part of the Apache organisation?


Uwe


[jira] [Commented] (PARQUET-721) Performance benchmarks for reading into Arrow structures

2016-09-21 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509983#comment-15509983
 ] 

Uwe L. Korn commented on PARQUET-721:
-

PR: https://github.com/apache/parquet-cpp/pull/165

> Performance benchmarks for reading into Arrow structures
> 
>
> Key: PARQUET-721
> URL: https://issues.apache.org/jira/browse/PARQUET-721
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>
> Simple benchmarks that show per column and repetition type how fast we can 
> read into Arrow memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-722) Building with JDK 8 fails over a maven bug

2016-09-21 Thread Niels Basjes (JIRA)
Niels Basjes created PARQUET-722:


 Summary: Building with JDK 8 fails over a maven bug
 Key: PARQUET-722
 URL: https://issues.apache.org/jira/browse/PARQUET-722
 Project: Parquet
  Issue Type: Bug
Reporter: Niels Basjes


When I build parquet on my system I get this error during the build:
{quote}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) on 
project parquet-generator: Error rendering velocity resource. 
NullPointerException -> [Help 1]
{quote}

About a year ago [~julienledem] responded that this is caused due to a bug in 
Maven in combination with Java 8:

At this page 
http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512
 

Now this bug has been solved at the Maven end in maven-filtering 1.2
https://issues.apache.org/jira/browse/MSHARED-319

The problem is that this fix has not yet been integrated into the latest 
available maven versions yet.

I'll put up a pull request with a proposed fix for this.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)