[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark
[ https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071288#comment-17071288 ] Felix Kizhakkel Jose commented on PARQUET-1830: --- (y) > Vectorized API to support Column Index in Apache Spark > -- > > Key: PARQUET-1830 > URL: https://issues.apache.org/jira/browse/PARQUET-1830 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Felix Kizhakkel Jose >Priority: Major > > As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its > seems like Apache Spark doesn't support Column Index until we disable > vectorizedReader in Spark - which will have other performance implications. > As per [~zi] , parquet-mr should implement a Vectorized API. Is it already > implemented or any pull request for the same? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark
[ https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070975#comment-17070975 ] Felix Kizhakkel Jose commented on PARQUET-1830: --- Yes, for short term, option 1. But this Jira is for long term solution, so I want this to be open until we have a Vectorized API. I will update the Spark Jira SPARK-26345 for short term solution. > Vectorized API to support Column Index in Apache Spark > -- > > Key: PARQUET-1830 > URL: https://issues.apache.org/jira/browse/PARQUET-1830 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Felix Kizhakkel Jose >Priority: Major > > As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its > seems like Apache Spark doesn't support Column Index until we disable > vectorizedReader in Spark - which will have other performance implications. > As per [~zi] , parquet-mr should implement a Vectorized API. Is it already > implemented or any pull request for the same? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark
[ https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070954#comment-17070954 ] Felix Kizhakkel Jose commented on PARQUET-1830: --- [~gszadovszky] This Jira is for long term solution. But do we have any Jira for a short term solution, since that could benefit many who are using Parquet + Spark? > Vectorized API to support Column Index in Apache Spark > -- > > Key: PARQUET-1830 > URL: https://issues.apache.org/jira/browse/PARQUET-1830 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Felix Kizhakkel Jose >Priority: Major > > As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its > seems like Apache Spark doesn't support Column Index until we disable > vectorizedReader in Spark - which will have other performance implications. > As per [~zi] , parquet-mr should implement a Vectorized API. Is it already > implemented or any pull request for the same? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark
[ https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068707#comment-17068707 ] Felix Kizhakkel Jose commented on PARQUET-1830: --- Thank you [~gszadovszky] IMHO, I prefer option 1 as a short term work around. This could benefit a lot of people by the great performance improvement by the Offset and Column Indexes. But for long term, in addition to Option 2, make it a Vectorized API and coordinate with Spark team to integrate the work. And avoid spilling of logic outside of core library(parquet) if possible or at least make it minimal. > Vectorized API to support Column Index in Apache Spark > -- > > Key: PARQUET-1830 > URL: https://issues.apache.org/jira/browse/PARQUET-1830 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Felix Kizhakkel Jose >Priority: Major > > As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its > seems like Apache Spark doesn't support Column Index until we disable > vectorizedReader in Spark - which will have other performance implications. > As per [~zi] , parquet-mr should implement a Vectorized API. Is it already > implemented or any pull request for the same? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark
[ https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated PARQUET-1830: -- Component/s: parquet-mr Affects Version/s: 1.11.0 Description: As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its seems like Apache Spark doesn't support Column Index until we disable vectorizedReader in Spark - which will have other performance implications. As per [~zi] , parquet-mr should implement a Vectorized API. Is it already implemented or any pull request for the same? Summary: Vectorized API to support Column Index in Apache Spark (was: Vectorized API to supprt) > Vectorized API to support Column Index in Apache Spark > -- > > Key: PARQUET-1830 > URL: https://issues.apache.org/jira/browse/PARQUET-1830 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Felix Kizhakkel Jose >Priority: Major > > As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its > seems like Apache Spark doesn't support Column Index until we disable > vectorizedReader in Spark - which will have other performance implications. > As per [~zi] , parquet-mr should implement a Vectorized API. Is it already > implemented or any pull request for the same? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1830) Vectorized API to supprt
Felix Kizhakkel Jose created PARQUET-1830: - Summary: Vectorized API to supprt Key: PARQUET-1830 URL: https://issues.apache.org/jira/browse/PARQUET-1830 Project: Parquet Issue Type: New Feature Reporter: Felix Kizhakkel Jose -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967620#comment-16967620 ] Felix Kizhakkel Jose commented on PARQUET-1679: --- Hi [~q.xu], Thank you so much. Do you know whether there is any Converter [Parquet Converter] instead of AvroConverter? I couldn't find one. Could you please provide some insights? > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL: https://issues.apache.org/jira/browse/PARQUET-1679 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a > schema with an empty group: optional group id {} while I include a UUID field > on my POJO object. Without UUID everything worked fine. I have seen Parquet > suports UUID as part of [#PR-71] on 2.4 release. > But I am getting InvalidSchemaException on UUID. Is there anything that I am > missing or its a known issue? > *My setup details:* > *gradle dependency :* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Model used:* > @Data > public class Employee > { private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > +*My Serializer Code:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = AvroParquetWriter.builder(path) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i-> > { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); > employee.setAge(20); employee.setName("Test"+i); > employee.setAddress(address); employees.add(employee); } > ); > return employees; > } > _**Where generic Type D is Employee_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961429#comment-16961429 ] Felix Kizhakkel Jose commented on PARQUET-1679: --- [~q.xu] what do you think? > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL: https://issues.apache.org/jira/browse/PARQUET-1679 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a > schema with an empty group: optional group id {} while I include a UUID field > on my POJO object. Without UUID everything worked fine. I have seen Parquet > suports UUID as part of [#PR-71] on 2.4 release. > But I am getting InvalidSchemaException on UUID. Is there anything that I am > missing or its a known issue? > *My setup details:* > *gradle dependency :* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Model used:* > @Data > public class Employee > { private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > +*My Serializer Code:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = AvroParquetWriter.builder(path) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i-> > { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); > employee.setAge(20); employee.setName("Test"+i); > employee.setAddress(address); employees.add(employee); } > ); > return employees; > } > _**Where generic Type D is Employee_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956249#comment-16956249 ] Felix Kizhakkel Jose edited comment on PARQUET-1679 at 10/21/19 4:39 PM: - When you mention _"You could just use a `byte[]` instead of `UUID` for your `id` field.", do you mean I have to change the type of the id field in model to byte[] instead of UUID for example:_ @Data public class Employee { private byte[] id; private String name; private int age; private Address address; } _instead of :_ @Data public class Employee { private UUID id; private String name; private int age; private Address address; } If yes, then updating model is not a viable solution for me, since this model's been used by other consumers as well was (Author: felixkjose): When you mention _"You could just use a `byte[]` instead of `UUID` for your `id` field.", do you mean I have to change the type of the id field in model to byte[] instead of UUID for example:_ @Data public class Employee { private *byte[] id*; private String name; private int age; private Address address; } _instead of :_ @Data public class Employee { private *UUID id*; private String name; private int age; private Address address; } > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL: https://issues.apache.org/jira/browse/PARQUET-1679 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a > schema with an empty group: optional group id {} while I include a UUID field > on my POJO object. Without UUID everything worked fine. I have seen Parquet > suports UUID as part of [#PR-71] on 2.4 release. > But I am getting InvalidSchemaException on UUID. Is there anything that I am > missing or its a known issue? > *My setup details:* > *gradle dependency :* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Model used:* > @Data > public class Employee > { private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > +*My Serializer Code:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = AvroParquetWriter.builder(path) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i-> > { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); > employee.setAge(20); employee.setName("Test"+i); > employee.setAddress(address); employees.add(employee); } > ); > return employees; > } > _**Where generic Type D is Employee_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956249#comment-16956249 ] Felix Kizhakkel Jose edited comment on PARQUET-1679 at 10/21/19 4:38 PM: - When you mention _"You could just use a `byte[]` instead of `UUID` for your `id` field.", do you mean I have to change the type of the id field in model to byte[] instead of UUID for example:_ @Data public class Employee { private *byte[] id*; private String name; private int age; private Address address; } _instead of :_ @Data public class Employee { private *UUID id*; private String name; private int age; private Address address; } was (Author: felixkjose): When you mention _"You could just use a `byte[]` instead of `UUID` for your `id` field.", do you mean I have to change the type of the id field in model to byte[] instead of UUID for example:_ @Data public class Employee { private *byte[] id*; private String name; private int age; private Address address; } _instead of :_ @Data public class Employee { private _*UUID id*_; private String name; private int age; private Address address; } _ > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL: https://issues.apache.org/jira/browse/PARQUET-1679 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a > schema with an empty group: optional group id {} while I include a UUID field > on my POJO object. Without UUID everything worked fine. I have seen Parquet > suports UUID as part of [#PR-71] on 2.4 release. > But I am getting InvalidSchemaException on UUID. Is there anything that I am > missing or its a known issue? > *My setup details:* > *gradle dependency :* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Model used:* > @Data > public class Employee > { private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > +*My Serializer Code:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = AvroParquetWriter.builder(path) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i-> > { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); > employee.setAge(20); employee.setName("Test"+i); > employee.setAddress(address); employees.add(employee); } > ); > return employees; > } > _**Where generic Type D is Employee_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956249#comment-16956249 ] Felix Kizhakkel Jose edited comment on PARQUET-1679 at 10/21/19 4:37 PM: - When you mention _"You could just use a `byte[]` instead of `UUID` for your `id` field.", do you mean I have to change the type of the id field in model to byte[] instead of UUID for example:_ @Data public class Employee { private *byte[] id*; private String name; private int age; private Address address; } _instead of :_ @Data public class Employee { private _*UUID id*_; private String name; private int age; private Address address; } _ was (Author: felixkjose): When you mention _"You could just use a `byte[]` instead of `UUID` for your `id` field.", do you mean I have to change the type of the id field in model to byte[] instead of UUID for example:_ @Data public class Employee { private *byte[]* id; private String name; private int age; private Address address; } _instead of :_ @Data public class Employee { private *UUID* id; private String name; private int age; private Address address; } _ > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL: https://issues.apache.org/jira/browse/PARQUET-1679 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a > schema with an empty group: optional group id {} while I include a UUID field > on my POJO object. Without UUID everything worked fine. I have seen Parquet > suports UUID as part of [#PR-71] on 2.4 release. > But I am getting InvalidSchemaException on UUID. Is there anything that I am > missing or its a known issue? > *My setup details:* > *gradle dependency :* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Model used:* > @Data > public class Employee > { private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > +*My Serializer Code:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = AvroParquetWriter.builder(path) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i-> > { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); > employee.setAge(20); employee.setName("Test"+i); > employee.setAddress(address); employees.add(employee); } > ); > return employees; > } > _**Where generic Type D is Employee_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956249#comment-16956249 ] Felix Kizhakkel Jose edited comment on PARQUET-1679 at 10/21/19 4:37 PM: - When you mention _"You could just use a `byte[]` instead of `UUID` for your `id` field.", do you mean I have to change the type of the id field in model to byte[] instead of UUID for example:_ @Data public class Employee { private *byte[]* id; private String name; private int age; private Address address; } _instead of :_ @Data public class Employee { private *UUID* id; private String name; private int age; private Address address; } _ was (Author: felixkjose): When you mention _"You could just use a `byte[]` instead of `UUID` for your `id` field.", do you mean I have to change the type of the id field in model to byte[] instead of UUID for example:_ @Data public class Employee _{ private *byte[] id*; private String name; private int age; private Address address; }_ _instead of :_ @Data public class Employee _{ private *UUID id*; private String name; private int age; private Address address; }_ > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL: https://issues.apache.org/jira/browse/PARQUET-1679 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a > schema with an empty group: optional group id {} while I include a UUID field > on my POJO object. Without UUID everything worked fine. I have seen Parquet > suports UUID as part of [#PR-71] on 2.4 release. > But I am getting InvalidSchemaException on UUID. Is there anything that I am > missing or its a known issue? > *My setup details:* > *gradle dependency :* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Model used:* > @Data > public class Employee > { private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > +*My Serializer Code:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = AvroParquetWriter.builder(path) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i-> > { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); > employee.setAge(20); employee.setName("Test"+i); > employee.setAddress(address); employees.add(employee); } > ); > return employees; > } > _**Where generic Type D is Employee_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956249#comment-16956249 ] Felix Kizhakkel Jose commented on PARQUET-1679: --- When you mention _"You could just use a `byte[]` instead of `UUID` for your `id` field.", do you mean I have to change the type of the id field in model to byte[] instead of UUID for example:_ @Data public class Employee _{ private *byte[] id*; private String name; private int age; private Address address; }_ _instead of :_ @Data public class Employee _{ private *UUID id*; private String name; private int age; private Address address; }_ > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL: https://issues.apache.org/jira/browse/PARQUET-1679 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a > schema with an empty group: optional group id {} while I include a UUID field > on my POJO object. Without UUID everything worked fine. I have seen Parquet > suports UUID as part of [#PR-71] on 2.4 release. > But I am getting InvalidSchemaException on UUID. Is there anything that I am > missing or its a known issue? > *My setup details:* > *gradle dependency :* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Model used:* > @Data > public class Employee > { private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > +*My Serializer Code:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = AvroParquetWriter.builder(path) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i-> > { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); > employee.setAge(20); employee.setName("Test"+i); > employee.setAddress(address); employees.add(employee); } > ); > return employees; > } > _**Where generic Type D is Employee_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1680) Parquet Java Serialization is very slow
[ https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated PARQUET-1680: -- Component/s: parquet-mr > Parquet Java Serialization is very slow > > > Key: PARQUET-1680 > URL: https://issues.apache.org/jira/browse/PARQUET-1680 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-mr >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am doing a POC to compare different data formats and its performance in > terms of serialization/deserialization speed, storage size, compatibility > between different language etc. > When I try to serialize a simple java object to parquet file, it takes > _*6-7 seconds*_ vs same object's serialization to JSON is *_100 > milliseconds._* > Could you help me to resolve this issue? > +*My Configuration and code snippet:* > *Gradle dependencies* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Code snippet:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = *AvroParquetWriter.*builder(path1) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > +*Model Used:* > @Data > public class Employee > { //private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i->{ > Employee employee = new Employee(); > // employee.setId(UUID.randomUUID()); > employee.setAge(20); > employee.setName("Test"+i); > employee.setAddress(address); > employees.add(employee); > }); > return employees; > } > *Note:* > *I have tried to save the data into local file system as well as AWS S3, but > both are having same result - very slow.* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954944#comment-16954944 ] Felix Kizhakkel Jose edited comment on PARQUET-1679 at 10/18/19 8:08 PM: - Hi [~q.xu], Thank you for the quick response. Could you please give me a sample or could you give me snippet on what you mentioned? PS: The model I have is already defined and used by other consumers, so I cannot modify the model was (Author: felixkjose): Hi [~q.xu], Thank you for the quick response. Could you please give me a sample or could you give me snippet on what you mentioned? > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL: https://issues.apache.org/jira/browse/PARQUET-1679 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a > schema with an empty group: optional group id {} while I include a UUID field > on my POJO object. Without UUID everything worked fine. I have seen Parquet > suports UUID as part of [#PR-71] on 2.4 release. > But I am getting InvalidSchemaException on UUID. Is there anything that I am > missing or its a known issue? > *My setup details:* > *gradle dependency :* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Model used:* > @Data > public class Employee > { private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > +*My Serializer Code:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = AvroParquetWriter.builder(path) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i-> > { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); > employee.setAge(20); employee.setName("Test"+i); > employee.setAddress(address); employees.add(employee); } > ); > return employees; > } > _**Where generic Type D is Employee_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954944#comment-16954944 ] Felix Kizhakkel Jose commented on PARQUET-1679: --- Hi [~q.xu], Thank you for the quick response. Could you please give me a sample or could you give me snippet on what you mentioned? > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL: https://issues.apache.org/jira/browse/PARQUET-1679 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a > schema with an empty group: optional group id {} while I include a UUID field > on my POJO object. Without UUID everything worked fine. I have seen Parquet > suports UUID as part of [#PR-71] on 2.4 release. > But I am getting InvalidSchemaException on UUID. Is there anything that I am > missing or its a known issue? > *My setup details:* > *gradle dependency :* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Model used:* > @Data > public class Employee > { private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > +*My Serializer Code:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = AvroParquetWriter.builder(path) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i-> > { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); > employee.setAge(20); employee.setName("Test"+i); > employee.setAddress(address); employees.add(employee); } > ); > return employees; > } > _**Where generic Type D is Employee_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954922#comment-16954922 ] Felix Kizhakkel Jose commented on PARQUET-1679: --- org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: required group id \{org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: required group id {} at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27) at org.apache.parquet.schema.GroupType.accept(GroupType.java:226) at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:31) at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37) at org.apache.parquet.schema.MessageType.accept(MessageType.java:55) at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23) at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:233) at org.apache.parquet.hadoop.ParquetWriter.(ParquetWriter.java:280) at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:535) at com.philips.felix.parquet.ParquetDataSerializer.serialize(ParquetDataSerializer.java:64) at com.philips.felix.parquet.Application.run(Application.java:62) at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800) at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) at org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) at com.philips.felix.parquet.Application.main(Application.java:37) > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL: https://issues.apache.org/jira/browse/PARQUET-1679 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a > schema with an empty group: optional group id {} while I include a UUID field > on my POJO object. Without UUID everything worked fine. I have seen Parquet > suports UUID as part of [#PR-71] on 2.4 release. > But I am getting InvalidSchemaException on UUID. Is there anything that I am > missing or its a known issue? > *My setup details:* > *gradle dependency :* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Model used:* > @Data > public class Employee > { private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > +*My Serializer Code:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = AvroParquetWriter.builder(path) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i-> > { Employee employee = new Employee(); //
[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954634#comment-16954634 ] Felix Kizhakkel Jose commented on PARQUET-1679: --- Could someone please help me on this? I am totally blocked with my analysis of data format comparison because UUID is a mandatory field in all my data models > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL: https://issues.apache.org/jira/browse/PARQUET-1679 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a > schema with an empty group: optional group id {} while I include a UUID field > on my POJO object. Without UUID everything worked fine. I have seen Parquet > suports UUID as part of [#PR-71] on 2.4 release. > But I am getting InvalidSchemaException on UUID. Is there anything that I am > missing or its a known issue? > *My setup details:* > *gradle dependency :* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Model used:* > @Data > public class Employee > { private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > +*My Serializer Code:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = AvroParquetWriter.builder(path) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i-> > { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); > employee.setAge(20); employee.setName("Test"+i); > employee.setAddress(address); employees.add(employee); } > ); > return employees; > } > _**Where generic Type D is Employee_ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1680) Parquet Java Serialization is very slow
[ https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954631#comment-16954631 ] Felix Kizhakkel Jose commented on PARQUET-1680: --- Could someone please help me on this? > Parquet Java Serialization is very slow > > > Key: PARQUET-1680 > URL: https://issues.apache.org/jira/browse/PARQUET-1680 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hi, > I am doing a POC to compare different data formats and its performance in > terms of serialization/deserialization speed, storage size, compatibility > between different language etc. > When I try to serialize a simple java object to parquet file, it takes > _*6-7 seconds*_ vs same object's serialization to JSON is *_100 > milliseconds._* > Could you help me to resolve this issue? > +*My Configuration and code snippet:* > *Gradle dependencies* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Code snippet:*+ > public void serialize(List inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter writer = *AvroParquetWriter.*builder(path1) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > +*Model Used:* > @Data > public class Employee > { //private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > private List *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List employees = new ArrayList<>(); > IntStream.range(0, 10).forEach(i->{ > Employee employee = new Employee(); > // employee.setId(UUID.randomUUID()); > employee.setAge(20); > employee.setName("Test"+i); > employee.setAddress(address); > employees.add(employee); > }); > return employees; > } > *Note:* > *I have tried to save the data into local file system as well as AWS S3, but > both are having same result - very slow.* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated PARQUET-1679: -- Description: Hi, I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: optional group id {} while I include a UUID field on my POJO object. Without UUID everything worked fine. I have seen Parquet suports UUID as part of [#PR-71] on 2.4 release. But I am getting InvalidSchemaException on UUID. Is there anything that I am missing or its a known issue? *My setup details:* *gradle dependency :* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Model used:* @Data public class Employee { private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public class Zip { private int zip; private int ext; } +*My Serializer Code:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = AvroParquetWriter.builder(path) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } private List *getInputDataToSerialize*(){ Address address = new Address(); address.setStreetName("Murry Ridge Dr"); address.setCity("Murrysville"); Zip zip = new Zip(); zip.setZip(15668); zip.setExt(1234); address.setZip(zip); List employees = new ArrayList<>(); IntStream.range(0, 10).forEach(i-> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); employee.setAge(20); employee.setName("Test"+i); employee.setAddress(address); employees.add(employee); } ); return employees; } _**Where generic Type D is Employee_ was: Hi, I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: optional group id {} while I include a UUID field on my POJO object. Without UUID everything worked fine. I have seen Parquet suports UUID as part of [#PR-71] on 2.4 release. But I am getting InvalidSchemaException on UUID. Is there anything that I am missing or its a known issue? *My setup details:* *gradle dependency :* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Model used:* @Data public class Employee { private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public class Zip { private int zip; private int ext; } +*My Serializer Code:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = AvroParquetWriter.builder(path) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields
[jira] [Updated] (PARQUET-1680) Parquet Java Serialization is very slow
[ https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated PARQUET-1680: -- Description: Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes _*6-7 seconds*_ vs same object's serialization to JSON is *_100 milliseconds._* Could you help me to resolve this issue? +*My Configuration and code snippet:* *Gradle dependencies* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Code snippet:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = *AvroParquetWriter.*builder(path1) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } +*Model Used:* @Data public class Employee { //private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public class Zip { private int zip; private int ext; } private List *getInputDataToSerialize*(){ Address address = new Address(); address.setStreetName("Murry Ridge Dr"); address.setCity("Murrysville"); Zip zip = new Zip(); zip.setZip(15668); zip.setExt(1234); address.setZip(zip); List employees = new ArrayList<>(); IntStream.range(0, 10).forEach(i->{ Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); employee.setAge(20); employee.setName("Test"+i); employee.setAddress(address); employees.add(employee); }); return employees; } *Note:* *I have tried to save the data into local file system as well as AWS S3, but both are having same result - very slow.* was: Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes _*6-7 seconds*_ vs same object's serialization to JSON is *_100 milliseconds._* Could you help me to resolve this issue? +*My Configuration and code snippet:* *Gradle dependencies* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Code snippet:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = *AvroParquetWriter.*builder(path1) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get())
[jira] [Updated] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated PARQUET-1679: -- Description: Hi, I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: optional group id {} while I include a UUID field on my POJO object. Without UUID everything worked fine. I have seen Parquet suports UUID as part of [#PR-71] on 2.4 release. But I am getting InvalidSchemaException on UUID. Is there anything that I am missing or its a known issue? *My setup details:* *gradle dependency :* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Model used:* @Data public class Employee { private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public class Zip { private int zip; private int ext; } +*My Serializer Code:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = AvroParquetWriter.builder(path) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } _**Where generic Type D is Employee_ was: Hi, I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: optional group id {} while I include a UUID field on my POJO object. Without UUID everything worked fine. I have seen Parquet suports UUID as part of [#PR-71] on 2.4 release. But I am getting InvalidSchemaException on UUID. Is there anything that I am missing or its a known issue? *My setup details:* *gradle dependency :* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Model used:* @Data public class Employee { private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public class Zip { private int zip; private int ext; } +*My Serializer Code:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = AvroParquetWriter.builder(path) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } _**Where generic Type D is Employee_ > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL:
[jira] [Updated] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated PARQUET-1679: -- Description: Hi, I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: optional group id {} while I include a UUID field on my POJO object. Without UUID everything worked fine. I have seen Parquet suports UUID as part of [#PR-71] on 2.4 release. But I am getting InvalidSchemaException on UUID. Is there anything that I am missing or its a known issue? *My setup details:* *gradle dependency :* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Model used:* @Data public class Employee { private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public class Zip { private int zip; private int ext; } +*My Serializer Code:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = AvroParquetWriter.builder(path) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } _**Where generic Type D is Employee_ was: Hi, I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: optional group id {} while I include a UUID field on my POJO object. Without UUID everything worked fine. I have seen Parquet suports UUID as part of [#PR-71] on 2.4 release. But I am getting InvalidSchemaException on UUID. Is there anything that I am missing or its a known issue? *My setup details:* *gradle dependency :* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Model used:* @Data public class Employee { private UUID id; private String name; private int age; private Address address; } +*My Serializer Code:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = AvroParquetWriter.builder(path) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } _**Where generic Type D is Employee_ > Invalid SchemaException for UUID while using AvroParquetWriter > -- > > Key: PARQUET-1679 > URL: https://issues.apache.org/jira/browse/PARQUET-1679 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects
[jira] [Updated] (PARQUET-1680) Parquet Java Serialization is very slow
[ https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated PARQUET-1680: -- Description: Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes _*6-7 seconds*_ vs same object's serialization to JSON is *_100 milliseconds._* Could you help me to resolve this issue? +*My Configuration and code snippet:* *Gradle dependencies* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Code snippet:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = *AvroParquetWriter.*builder(path1) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } +*Model Used:* @Data public class Employee { //private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public class Zip { private int zip; private int ext; } + *Note:* *I have tried to save the data into local file system as well as AWS S3, but both are having same result - very slow.* was: Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes _*6-7 seconds*_ vs same object's serialization to JSON is *_100 milliseconds._* Could you help me to resolve this issue? +*My Configuration and code snippet:* *Gradle dependencies* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Code snippet:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet"); Path path1 = new Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = *AvroParquetWriter.*builder(path1) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } +*Model Used:* @Data public class Employee { //private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public class Zip {
[jira] [Updated] (PARQUET-1680) Parquet Java Serialization is very slow
[ https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated PARQUET-1680: -- Description: Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes _*6-7 seconds*_ vs same object's serialization to JSON is *_100 milliseconds._* Could you help me to resolve this issue? +*My Configuration and code snippet:* *Gradle dependencies* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Code snippet:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet"); Path path1 = new Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = *AvroParquetWriter.*builder(path1) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } +*Model Used:* @Data public class Employee { //private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public class Zip { private int zip; private int ext; } + *Note:* *I have tried to save the data into local file system as well as AWS S3, but both are having same result - very slow.* was: Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes 6-7 seconds vs same object's serialization to JSON is 100 milliseconds. Could you help me to resolve this issue? +*My Configuration and code snippet:* *Gradle dependencies* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Code snippet:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet"); Path path1 = new Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = *AvroParquetWriter.*builder(path1) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } +*Model Used:* @Data public class Employee { //private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public
[jira] [Updated] (PARQUET-1680) Parquet Java Serialization is very slow
[ https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated PARQUET-1680: -- Description: Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes 6-7 seconds vs same object's serialization to JSON is 100 milliseconds. Could you help me to resolve this issue? +*My Configuration and code snippet:* *Gradle dependencies* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Code snippet:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet"); Path path1 = new Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = *AvroParquetWriter.*builder(path1) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } +*Model Used:* @Data public class Employee { //private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public class Zip { private int zip; private int ext; } + *Note:* *I have tried to save the data into local file system as well as AWS S3, but both are having same result - very slow.* was: Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes 6-7 seconds vs same object's serialization to JSON is 100 milliseconds. Could you help me to resolve this issue? +*My Configuration and code snippet:* *Gradle dependencies* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Code snippet:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet"); Path path1 = new Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = *AvroParquetWriter.*builder(path1) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } *Note:* *I have tried to save the data into local file system as well as AWS S3, but both are having same result - very slow.* > Parquet Java Serialization is very slow > > > Key: PARQUET-1680 >
[jira] [Created] (PARQUET-1680) Parquet Java Serialization is very slow
Felix Kizhakkel Jose created PARQUET-1680: - Summary: Parquet Java Serialization is very slow Key: PARQUET-1680 URL: https://issues.apache.org/jira/browse/PARQUET-1680 Project: Parquet Issue Type: Bug Components: parquet-avro Affects Versions: 1.10.1 Reporter: Felix Kizhakkel Jose Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes 6-7 seconds vs same object's serialization to JSON is 100 milliseconds. Could you help me to resolve this issue? +*My Configuration and code snippet:* *Gradle dependencies* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Code snippet:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet"); Path path1 = new Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = *AvroParquetWriter.*builder(path1) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } *Note:* *I have tried to save the data into local file system as well as AWS S3, but both are having same result - very slow.* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter
Felix Kizhakkel Jose created PARQUET-1679: - Summary: Invalid SchemaException for UUID while using AvroParquetWriter Key: PARQUET-1679 URL: https://issues.apache.org/jira/browse/PARQUET-1679 Project: Parquet Issue Type: Bug Components: parquet-avro Affects Versions: 1.10.1 Reporter: Felix Kizhakkel Jose Hi, I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: optional group id {} while I include a UUID field on my POJO object. Without UUID everything worked fine. I have seen Parquet suports UUID as part of [#PR-71] on 2.4 release. But I am getting InvalidSchemaException on UUID. Is there anything that I am missing or its a known issue? *My setup details:* *gradle dependency :* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Model used:* @Data public class Employee { private UUID id; private String name; private int age; private Address address; } +*My Serializer Code:*+ public void serialize(List inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter writer = AvroParquetWriter.builder(path) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } _**Where generic Type D is Employee_ -- This message was sent by Atlassian Jira (v8.3.4#803005)