[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

2020-03-30 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071288#comment-17071288
 ] 

Felix Kizhakkel Jose commented on PARQUET-1830:
---

(y)

> Vectorized API to support Column Index in Apache Spark
> --
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

2020-03-30 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070975#comment-17070975
 ] 

Felix Kizhakkel Jose commented on PARQUET-1830:
---

Yes, for short term, option 1. But this Jira is for long term solution, so I 
want this to be open until we have a Vectorized API. 
I will update the Spark Jira SPARK-26345 for short term solution.

> Vectorized API to support Column Index in Apache Spark
> --
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

2020-03-30 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070954#comment-17070954
 ] 

Felix Kizhakkel Jose commented on PARQUET-1830:
---

[~gszadovszky] This Jira is for long term solution. But do we have any Jira for 
a short term solution, since that could benefit many who are using Parquet + 
Spark?

> Vectorized API to support Column Index in Apache Spark
> --
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

2020-03-27 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068707#comment-17068707
 ] 

Felix Kizhakkel Jose commented on PARQUET-1830:
---

Thank you [~gszadovszky] 

IMHO, I prefer option 1 as a short term work around. This could benefit a lot 
of people by the great performance improvement by the Offset and Column Indexes.

But for long term, in addition to Option 2, make it a Vectorized API and 
coordinate with Spark team to integrate the work. And avoid spilling of logic 
outside of core library(parquet) if possible or at least make it minimal.

> Vectorized API to support Column Index in Apache Spark
> --
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

2020-03-26 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated PARQUET-1830:
--
  Component/s: parquet-mr
Affects Version/s: 1.11.0
  Description: As per the comment on 
https://issues.apache.org/jira/browse/SPARK-26345. Its seems like Apache Spark 
doesn't support Column Index until we disable vectorizedReader in Spark - which 
will have other performance implications. As per [~zi] , parquet-mr should 
implement a Vectorized API. Is it already implemented or any pull request for 
the same?
  Summary: Vectorized API to support Column Index in Apache Spark  
(was: Vectorized API to supprt)

> Vectorized API to support Column Index in Apache Spark
> --
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1830) Vectorized API to supprt

2020-03-26 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created PARQUET-1830:
-

 Summary: Vectorized API to supprt
 Key: PARQUET-1830
 URL: https://issues.apache.org/jira/browse/PARQUET-1830
 Project: Parquet
  Issue Type: New Feature
Reporter: Felix Kizhakkel Jose






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-11-05 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967620#comment-16967620
 ] 

Felix Kizhakkel Jose commented on PARQUET-1679:
---

Hi [~q.xu],
Thank you so much. Do you know whether there is any Converter [Parquet 
Converter] instead of AvroConverter? I couldn't find one. Could you please 
provide some insights?

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-28 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961429#comment-16961429
 ] 

Felix Kizhakkel Jose commented on PARQUET-1679:
---

[~q.xu] what do you think?

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-21 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956249#comment-16956249
 ] 

Felix Kizhakkel Jose edited comment on PARQUET-1679 at 10/21/19 4:39 PM:
-

When you mention _"You could just use a `byte[]` instead of `UUID` for your 
`id` field.", do you mean I have to change the type of the id field in model to 
byte[] instead of UUID for example:_

@Data
 public class Employee

{ private byte[] id; private String name; private int age; private Address 
address; }

_instead of :_
 @Data
 public class Employee

{ private UUID id; private String name; private int age; private Address 
address; }

If yes, then updating model is not a viable solution for me, since this model's 
been used by other consumers as well


was (Author: felixkjose):
When you mention _"You could just use a `byte[]` instead of `UUID` for your 
`id` field.", do you mean I have to change the type of the id field in model to 
byte[] instead of UUID for example:_

@Data
 public class Employee

{ private *byte[] id*; private String name; private int age; private Address 
address; }

_instead of :_
 @Data
 public class Employee

{ private *UUID id*; private String name; private int age; private Address 
address; }

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-21 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956249#comment-16956249
 ] 

Felix Kizhakkel Jose edited comment on PARQUET-1679 at 10/21/19 4:38 PM:
-

When you mention _"You could just use a `byte[]` instead of `UUID` for your 
`id` field.", do you mean I have to change the type of the id field in model to 
byte[] instead of UUID for example:_

@Data
 public class Employee

{ private *byte[] id*; private String name; private int age; private Address 
address; }

_instead of :_
 @Data
 public class Employee

{ private *UUID id*; private String name; private int age; private Address 
address; }


was (Author: felixkjose):
When you mention _"You could just use a `byte[]` instead of `UUID` for your 
`id` field.", do you mean I have to change the type of the id field in model to 
byte[] instead of UUID for example:_

@Data
 public class Employee

{ private *byte[] id*; private String name; private int age; private Address 
address; }

_instead of :_
 @Data
 public class Employee

{ private _*UUID id*_; private String name; private int age; private Address 
address; }

_

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-21 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956249#comment-16956249
 ] 

Felix Kizhakkel Jose edited comment on PARQUET-1679 at 10/21/19 4:37 PM:
-

When you mention _"You could just use a `byte[]` instead of `UUID` for your 
`id` field.", do you mean I have to change the type of the id field in model to 
byte[] instead of UUID for example:_

@Data
 public class Employee

{ private *byte[] id*; private String name; private int age; private Address 
address; }

_instead of :_
 @Data
 public class Employee

{ private _*UUID id*_; private String name; private int age; private Address 
address; }

_


was (Author: felixkjose):
When you mention _"You could just use a `byte[]` instead of `UUID` for your 
`id` field.", do you mean I have to change the type of the id field in model to 
byte[] instead of UUID for example:_

@Data
 public class Employee

{ private *byte[]* id; private String name; private int age; private Address 
address; }

_instead of :_
 @Data
 public class Employee

{ private *UUID* id; private String name; private int age; private Address 
address; }

_

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-21 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956249#comment-16956249
 ] 

Felix Kizhakkel Jose edited comment on PARQUET-1679 at 10/21/19 4:37 PM:
-

When you mention _"You could just use a `byte[]` instead of `UUID` for your 
`id` field.", do you mean I have to change the type of the id field in model to 
byte[] instead of UUID for example:_

@Data
 public class Employee

{ private *byte[]* id; private String name; private int age; private Address 
address; }

_instead of :_
 @Data
 public class Employee

{ private *UUID* id; private String name; private int age; private Address 
address; }

_


was (Author: felixkjose):
When you mention _"You could just use a `byte[]` instead of `UUID` for your 
`id` field.", do you mean I have to change the type of the id field in model to 
byte[] instead of UUID for example:_

@Data
public class Employee

_{ private *byte[] id*; private String name; private int age; private Address 
address; }_

_instead of :_
@Data
public class Employee

_{ private *UUID id*; private String name; private int age; private Address 
address; }_

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-21 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956249#comment-16956249
 ] 

Felix Kizhakkel Jose commented on PARQUET-1679:
---

When you mention _"You could just use a `byte[]` instead of `UUID` for your 
`id` field.", do you mean I have to change the type of the id field in model to 
byte[] instead of UUID for example:_

@Data
public class Employee

_{ private *byte[] id*; private String name; private int age; private Address 
address; }_

_instead of :_
@Data
public class Employee

_{ private *UUID id*; private String name; private int age; private Address 
address; }_

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1680) Parquet Java Serialization is very slow

2019-10-21 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated PARQUET-1680:
--
Component/s: parquet-mr

> Parquet Java Serialization is  very slow
> 
>
> Key: PARQUET-1680
> URL: https://issues.apache.org/jira/browse/PARQUET-1680
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro, parquet-mr
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
>  I am doing a POC to compare different data formats and its performance in 
> terms of serialization/deserialization speed, storage size, compatibility 
> between different language etc. 
>  When I try to serialize a simple java object to parquet file,  it takes 
> _*6-7 seconds*_ vs same object's serialization to JSON is *_100 
> milliseconds._*
> Could you help me to resolve this issue?
> +*My Configuration and code snippet:*
>  *Gradle dependencies*
>  dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Code snippet:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = *AvroParquetWriter.*builder(path1)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> +*Model Used:*
>  @Data
>  public class Employee
> { //private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> private List *getInputDataToSerialize*(){
>  Address address = new Address();
>  address.setStreetName("Murry Ridge Dr");
>  address.setCity("Murrysville");
>  Zip zip = new Zip();
>  zip.setZip(15668);
>  zip.setExt(1234);
>  address.setZip(zip);
>  List employees = new ArrayList<>();
>  IntStream.range(0, 10).forEach(i->{
>  Employee employee = new Employee();
>  // employee.setId(UUID.randomUUID());
>  employee.setAge(20);
>  employee.setName("Test"+i);
>  employee.setAddress(address);
>  employees.add(employee);
>  });
> return employees;
> }
> *Note:*
>  *I have tried to save the data into local file system as well as AWS S3, but 
> both are having same result - very slow.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-18 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954944#comment-16954944
 ] 

Felix Kizhakkel Jose edited comment on PARQUET-1679 at 10/18/19 8:08 PM:
-

Hi [~q.xu], 
 Thank you for the quick response. Could you please give me a sample or could 
you give me snippet on what you mentioned?

PS: The model I have is already defined and used by other consumers, so I 
cannot modify the model


was (Author: felixkjose):
Hi [~q.xu], 
Thank you for the quick response. Could you please give me a sample or could 
you give me snippet on what you mentioned?

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-18 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954944#comment-16954944
 ] 

Felix Kizhakkel Jose commented on PARQUET-1679:
---

Hi [~q.xu], 
Thank you for the quick response. Could you please give me a sample or could 
you give me snippet on what you mentioned?

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-18 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954922#comment-16954922
 ] 

Felix Kizhakkel Jose commented on PARQUET-1679:
---

org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an 
empty group: required group id 
\{org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with 
an empty group: required group id {} at 
org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27) at 
org.apache.parquet.schema.GroupType.accept(GroupType.java:226) at 
org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:31) at 
org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37) at 
org.apache.parquet.schema.MessageType.accept(MessageType.java:55) at 
org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23) at 
org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:233) 
at org.apache.parquet.hadoop.ParquetWriter.(ParquetWriter.java:280) at 
org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:535) 
at 
com.philips.felix.parquet.ParquetDataSerializer.serialize(ParquetDataSerializer.java:64)
 at com.philips.felix.parquet.Application.run(Application.java:62) at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
 at 
org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
 at 
org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
 at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) 
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
at com.philips.felix.parquet.Application.main(Application.java:37)

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // 

[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-18 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954634#comment-16954634
 ] 

Felix Kizhakkel Jose commented on PARQUET-1679:
---

Could someone please help me on this? I am totally blocked with my analysis of 
data format comparison because UUID is a mandatory field in all my data models

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1680) Parquet Java Serialization is very slow

2019-10-18 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954631#comment-16954631
 ] 

Felix Kizhakkel Jose commented on PARQUET-1680:
---

Could someone please help me on this?

> Parquet Java Serialization is  very slow
> 
>
> Key: PARQUET-1680
> URL: https://issues.apache.org/jira/browse/PARQUET-1680
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
>  I am doing a POC to compare different data formats and its performance in 
> terms of serialization/deserialization speed, storage size, compatibility 
> between different language etc. 
>  When I try to serialize a simple java object to parquet file,  it takes 
> _*6-7 seconds*_ vs same object's serialization to JSON is *_100 
> milliseconds._*
> Could you help me to resolve this issue?
> +*My Configuration and code snippet:*
>  *Gradle dependencies*
>  dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Code snippet:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = *AvroParquetWriter.*builder(path1)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> +*Model Used:*
>  @Data
>  public class Employee
> { //private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> private List *getInputDataToSerialize*(){
>  Address address = new Address();
>  address.setStreetName("Murry Ridge Dr");
>  address.setCity("Murrysville");
>  Zip zip = new Zip();
>  zip.setZip(15668);
>  zip.setExt(1234);
>  address.setZip(zip);
>  List employees = new ArrayList<>();
>  IntStream.range(0, 10).forEach(i->{
>  Employee employee = new Employee();
>  // employee.setId(UUID.randomUUID());
>  employee.setAge(20);
>  employee.setName("Test"+i);
>  employee.setAddress(address);
>  employees.add(employee);
>  });
> return employees;
> }
> *Note:*
>  *I have tried to save the data into local file system as well as AWS S3, but 
> both are having same result - very slow.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-17 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated PARQUET-1679:
--
Description: 
Hi,

I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
schema with an empty group: optional group id {} while I include a UUID field 
on my POJO object. Without UUID everything worked fine. I have seen Parquet 
suports UUID as part of [#PR-71] on 2.4 release. 
 But I am getting InvalidSchemaException on UUID. Is there anything that I am 
missing or its a known issue?

*My setup details:*

*gradle dependency :*

dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Model used:*

@Data
 public class Employee

{ private UUID id; private String name; private int age; private Address 
address; }

@Data
 public class Address

{ private String streetName; private String city; private Zip zip; }

@Data
 public class Zip

{ private int zip; private int ext; }

 

+*My Serializer Code:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new 
Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = AvroParquetWriter.builder(path)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

}
 }



private List *getInputDataToSerialize*(){
Address address = new Address();
address.setStreetName("Murry Ridge Dr");
address.setCity("Murrysville");
Zip zip = new Zip();
zip.setZip(15668);
zip.setExt(1234);

address.setZip(zip);

List employees = new ArrayList<>();

IntStream.range(0, 10).forEach(i->

{ Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
employee.setAge(20); employee.setName("Test"+i); employee.setAddress(address); 
employees.add(employee); }

);
return employees;
}

_**Where generic Type D is Employee_

  was:
Hi,

I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
schema with an empty group: optional group id {} while I include a UUID field 
on my POJO object. Without UUID everything worked fine. I have seen Parquet 
suports UUID as part of [#PR-71] on 2.4 release. 
 But I am getting InvalidSchemaException on UUID. Is there anything that I am 
missing or its a known issue?

*My setup details:*

*gradle dependency :*

dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Model used:*

@Data
 public class Employee

{ private UUID id; private String name; private int age; private Address 
address; }

@Data
 public class Address

{ private String streetName; private String city; private Zip zip; }

@Data
 public class Zip

{ private int zip; private int ext; }

 

+*My Serializer Code:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new 
Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = AvroParquetWriter.builder(path)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 

[jira] [Updated] (PARQUET-1680) Parquet Java Serialization is very slow

2019-10-17 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated PARQUET-1680:
--
Description: 
Hi,
 I am doing a POC to compare different data formats and its performance in 
terms of serialization/deserialization speed, storage size, compatibility 
between different language etc. 
 When I try to serialize a simple java object to parquet file,  it takes _*6-7 
seconds*_ vs same object's serialization to JSON is *_100 milliseconds._*

Could you help me to resolve this issue?

+*My Configuration and code snippet:*
 *Gradle dependencies*
 dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Code snippet:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new 
Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
 Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = *AvroParquetWriter.*builder(path1)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

}
 }

+*Model Used:*
 @Data
 public class Employee

{ //private UUID id; private String name; private int age; private Address 
address; }

@Data
 public class Address

{ private String streetName; private String city; private Zip zip; }

@Data
 public class Zip

{ private int zip; private int ext; }

 

private List *getInputDataToSerialize*(){
 Address address = new Address();
 address.setStreetName("Murry Ridge Dr");
 address.setCity("Murrysville");
 Zip zip = new Zip();
 zip.setZip(15668);
 zip.setExt(1234);

 address.setZip(zip);

 List employees = new ArrayList<>();

 IntStream.range(0, 10).forEach(i->{
 Employee employee = new Employee();
 // employee.setId(UUID.randomUUID());
 employee.setAge(20);
 employee.setName("Test"+i);
 employee.setAddress(address);
 employees.add(employee);
 });
return employees;
}



*Note:*
 *I have tried to save the data into local file system as well as AWS S3, but 
both are having same result - very slow.*

  was:
Hi,
 I am doing a POC to compare different data formats and its performance in 
terms of serialization/deserialization speed, storage size, compatibility 
between different language etc. 
 When I try to serialize a simple java object to parquet file,  it takes _*6-7 
seconds*_ vs same object's serialization to JSON is *_100 milliseconds._*

Could you help me to resolve this issue?

+*My Configuration and code snippet:*
 *Gradle dependencies*
 dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Code snippet:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new 
Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
 Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = *AvroParquetWriter.*builder(path1)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 

[jira] [Updated] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-16 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated PARQUET-1679:
--
Description: 
Hi,

I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
schema with an empty group: optional group id {} while I include a UUID field 
on my POJO object. Without UUID everything worked fine. I have seen Parquet 
suports UUID as part of [#PR-71] on 2.4 release. 
 But I am getting InvalidSchemaException on UUID. Is there anything that I am 
missing or its a known issue?

*My setup details:*

*gradle dependency :*

dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Model used:*

@Data
 public class Employee

{ private UUID id; private String name; private int age; private Address 
address; }

@Data
 public class Address

{ private String streetName; private String city; private Zip zip; }

@Data
 public class Zip

{ private int zip; private int ext; }

 

+*My Serializer Code:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new 
Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = AvroParquetWriter.builder(path)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

}
 }

_**Where generic Type D is Employee_

  was:
Hi,

I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
schema with an empty group: optional group id {} while I include a UUID field 
on my POJO object. Without UUID everything worked fine. I have seen Parquet 
suports UUID as part of [#PR-71] on 2.4 release. 
 But I am getting InvalidSchemaException on UUID. Is there anything that I am 
missing or its a known issue?

*My setup details:*

*gradle dependency :*

dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Model used:*

@Data
 public class Employee

{ private UUID id; private String name; private int age; private Address 
address; }



@Data
public class Address {
 private String streetName;
 private String city;
 private Zip zip;
}



@Data
public class Zip {
 private int zip;
 private int ext;
}

 

+*My Serializer Code:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = AvroParquetWriter.builder(path)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

}
 }

_**Where generic Type D is Employee_


> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: 

[jira] [Updated] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-16 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated PARQUET-1679:
--
Description: 
Hi,

I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
schema with an empty group: optional group id {} while I include a UUID field 
on my POJO object. Without UUID everything worked fine. I have seen Parquet 
suports UUID as part of [#PR-71] on 2.4 release. 
 But I am getting InvalidSchemaException on UUID. Is there anything that I am 
missing or its a known issue?

*My setup details:*

*gradle dependency :*

dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Model used:*

@Data
 public class Employee

{ private UUID id; private String name; private int age; private Address 
address; }



@Data
public class Address {
 private String streetName;
 private String city;
 private Zip zip;
}



@Data
public class Zip {
 private int zip;
 private int ext;
}

 

+*My Serializer Code:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = AvroParquetWriter.builder(path)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

}
 }

_**Where generic Type D is Employee_

  was:
Hi,

I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
schema with an empty group: optional group id {} while I include a UUID field 
on my POJO object. Without UUID everything worked fine. I have seen Parquet 
suports UUID as part of [#PR-71] on 2.4 release. 
But I am getting InvalidSchemaException on UUID. Is there anything that I am 
missing or its a known issue?

*My setup details:*

*gradle dependency :*

dependencies {
 compile group: 'org.springframework.boot', name: 'spring-boot-starter'
 compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6'

 compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: 
'1.11.271'
 compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1'
 compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1'
 compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1'
 compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1'
 compile group: 'joda-time', name: 'joda-time'
 compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', 
version: '2.6.5'
 compile group: 'com.fasterxml.jackson.datatype', name: 
'jackson-datatype-joda', version: '2.6.5'
}

*Model used:*

@Data
public class Employee {
 private UUID id;
 private String name;
 private int age;
 private Address address;
}

+*My Serializer Code:*+



public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

 Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

 try (ParquetWriter writer = AvroParquetWriter.builder(path)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

 for (D input : inputDataToSerialize) {
 writer.write(input);
 }
 }
}

_**Where generic Type D is Employee_


> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects 

[jira] [Updated] (PARQUET-1680) Parquet Java Serialization is very slow

2019-10-16 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated PARQUET-1680:
--
Description: 
Hi,
 I am doing a POC to compare different data formats and its performance in 
terms of serialization/deserialization speed, storage size, compatibility 
between different language etc. 
 When I try to serialize a simple java object to parquet file,  it takes _*6-7 
seconds*_ vs same object's serialization to JSON is *_100 milliseconds._*

Could you help me to resolve this issue?

+*My Configuration and code snippet:*
 *Gradle dependencies*
 dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Code snippet:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new 
Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
 Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = *AvroParquetWriter.*builder(path1)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

}
 }

+*Model Used:*
 @Data
 public class Employee

{ //private UUID id; private String name; private int age; private Address 
address; }

@Data
 public class Address

{ private String streetName; private String city; private Zip zip; }

@Data
 public class Zip

{ private int zip; private int ext; }

+

*Note:*
 *I have tried to save the data into local file system as well as AWS S3, but 
both are having same result - very slow.*

  was:
Hi,
 I am doing a POC to compare different data formats and its performance in 
terms of serialization/deserialization speed, storage size, compatibility 
between different language etc. 
 When I try to serialize a simple java object to parquet file,  it takes _*6-7 
seconds*_ vs same object's serialization to JSON is *_100 milliseconds._*

Could you help me to resolve this issue?

+*My Configuration and code snippet:*
 *Gradle dependencies*
 dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Code snippet:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
 Path path1 = new 
Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = *AvroParquetWriter.*builder(path1)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

}
 }

+*Model Used:*
 @Data
 public class Employee

{ //private UUID id; private String name; private int age; private Address 
address; }

@Data
 public class Address

{ private String streetName; private String city; private Zip zip; }

@Data
 public class Zip

{ 

[jira] [Updated] (PARQUET-1680) Parquet Java Serialization is very slow

2019-10-16 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated PARQUET-1680:
--
Description: 
Hi,
 I am doing a POC to compare different data formats and its performance in 
terms of serialization/deserialization speed, storage size, compatibility 
between different language etc. 
 When I try to serialize a simple java object to parquet file,  it takes _*6-7 
seconds*_ vs same object's serialization to JSON is *_100 milliseconds._*

Could you help me to resolve this issue?

+*My Configuration and code snippet:*
 *Gradle dependencies*
 dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Code snippet:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
 Path path1 = new 
Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = *AvroParquetWriter.*builder(path1)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

}
 }

+*Model Used:*
 @Data
 public class Employee

{ //private UUID id; private String name; private int age; private Address 
address; }

@Data
 public class Address

{ private String streetName; private String city; private Zip zip; }

@Data
 public class Zip

{ private int zip; private int ext; }

+

*Note:*
 *I have tried to save the data into local file system as well as AWS S3, but 
both are having same result - very slow.*

  was:
Hi,
 I am doing a POC to compare different data formats and its performance in 
terms of serialization/deserialization speed, storage size, compatibility 
between different language etc. 
 When I try to serialize a simple java object to parquet file,  it takes 6-7 
seconds vs same object's serialization to JSON is 100 milliseconds.

Could you help me to resolve this issue?

+*My Configuration and code snippet:*
 *Gradle dependencies*
 dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Code snippet:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
 Path path1 = new 
Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = *AvroParquetWriter.*builder(path1)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

}
 }

+*Model Used:*
@Data
public class Employee {
 //private UUID id;
 private String name;
 private int age;
 private Address address;
}

@Data
public class Address {
 private String streetName;
 private String city;
 private Zip zip;
}

@Data
public 

[jira] [Updated] (PARQUET-1680) Parquet Java Serialization is very slow

2019-10-16 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated PARQUET-1680:
--
Description: 
Hi,
 I am doing a POC to compare different data formats and its performance in 
terms of serialization/deserialization speed, storage size, compatibility 
between different language etc. 
 When I try to serialize a simple java object to parquet file,  it takes 6-7 
seconds vs same object's serialization to JSON is 100 milliseconds.

Could you help me to resolve this issue?

+*My Configuration and code snippet:*
 *Gradle dependencies*
 dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Code snippet:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
 Path path1 = new 
Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter writer = *AvroParquetWriter.*builder(path1)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

}
 }

+*Model Used:*
@Data
public class Employee {
 //private UUID id;
 private String name;
 private int age;
 private Address address;
}

@Data
public class Address {
 private String streetName;
 private String city;
 private Zip zip;
}

@Data
public class Zip {
 private int zip;
 private int ext;
}
+

*Note:*
 *I have tried to save the data into local file system as well as AWS S3, but 
both are having same result - very slow.*

  was:
Hi,
I am doing a POC to compare different data formats and its performance in terms 
of serialization/deserialization speed, storage size, compatibility between 
different language etc. 
When I try to serialize a simple java object to parquet file,  it takes 6-7 
seconds vs same object's serialization to JSON is 100 milliseconds.

Could you help me to resolve this issue?

+*My Configuration and code snippet:*
*Gradle dependencies*
dependencies {
 compile group: 'org.springframework.boot', name: 'spring-boot-starter'
 compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6'

 compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: 
'1.11.271'
 compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0'
 compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1'
 compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1'
 compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1'
 compile group: 'joda-time', name: 'joda-time'
 compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', 
version: '2.6.5'
 compile group: 'com.fasterxml.jackson.datatype', name: 
'jackson-datatype-joda', version: '2.6.5'
}

*Code snippet:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

 Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
 Path path1 = new 
Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

 try (ParquetWriter writer = *AvroParquetWriter.*builder(path1)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

 for (D input : inputDataToSerialize) {
 writer.write(input);
 }
 }
}

*Note:*
*I have tried to save the data into local file system as well as AWS S3, but 
both are having same result - very slow.*


> Parquet Java Serialization is  very slow
> 
>
> Key: PARQUET-1680
>  

[jira] [Created] (PARQUET-1680) Parquet Java Serialization is very slow

2019-10-16 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created PARQUET-1680:
-

 Summary: Parquet Java Serialization is  very slow
 Key: PARQUET-1680
 URL: https://issues.apache.org/jira/browse/PARQUET-1680
 Project: Parquet
  Issue Type: Bug
  Components: parquet-avro
Affects Versions: 1.10.1
Reporter: Felix Kizhakkel Jose


Hi,
I am doing a POC to compare different data formats and its performance in terms 
of serialization/deserialization speed, storage size, compatibility between 
different language etc. 
When I try to serialize a simple java object to parquet file,  it takes 6-7 
seconds vs same object's serialization to JSON is 100 milliseconds.

Could you help me to resolve this issue?

+*My Configuration and code snippet:*
*Gradle dependencies*
dependencies {
 compile group: 'org.springframework.boot', name: 'spring-boot-starter'
 compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6'

 compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: 
'1.11.271'
 compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0'
 compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1'
 compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1'
 compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1'
 compile group: 'joda-time', name: 'joda-time'
 compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', 
version: '2.6.5'
 compile group: 'com.fasterxml.jackson.datatype', name: 
'jackson-datatype-joda', version: '2.6.5'
}

*Code snippet:*+

public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

 Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
 Path path1 = new 
Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

 try (ParquetWriter writer = *AvroParquetWriter.*builder(path1)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

 for (D input : inputDataToSerialize) {
 writer.write(input);
 }
 }
}

*Note:*
*I have tried to save the data into local file system as well as AWS S3, but 
both are having same result - very slow.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-16 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created PARQUET-1679:
-

 Summary: Invalid SchemaException for UUID while using 
AvroParquetWriter
 Key: PARQUET-1679
 URL: https://issues.apache.org/jira/browse/PARQUET-1679
 Project: Parquet
  Issue Type: Bug
  Components: parquet-avro
Affects Versions: 1.10.1
Reporter: Felix Kizhakkel Jose


Hi,

I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
schema with an empty group: optional group id {} while I include a UUID field 
on my POJO object. Without UUID everything worked fine. I have seen Parquet 
suports UUID as part of [#PR-71] on 2.4 release. 
But I am getting InvalidSchemaException on UUID. Is there anything that I am 
missing or its a known issue?

*My setup details:*

*gradle dependency :*

dependencies {
 compile group: 'org.springframework.boot', name: 'spring-boot-starter'
 compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6'

 compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: 
'1.11.271'
 compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1'
 compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1'
 compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1'
 compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1'
 compile group: 'joda-time', name: 'joda-time'
 compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', 
version: '2.6.5'
 compile group: 'com.fasterxml.jackson.datatype', name: 
'jackson-datatype-joda', version: '2.6.5'
}

*Model used:*

@Data
public class Employee {
 private UUID id;
 private String name;
 private int age;
 private Address address;
}

+*My Serializer Code:*+



public void serialize(List inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

 Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

 try (ParquetWriter writer = AvroParquetWriter.builder(path)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

 for (D input : inputDataToSerialize) {
 writer.write(input);
 }
 }
}

_**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)