[ https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Felix Kizhakkel Jose updated PARQUET-1680: ------------------------------------------ Description: Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes _*6-7 seconds*_ vs same object's serialization to JSON is *_100 milliseconds._* Could you help me to resolve this issue? +*My Configuration and code snippet:* *Gradle dependencies* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Code snippet:*+ public void serialize(List<D> inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter<D> writer = *AvroParquetWriter.*<D>builder(path1) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } +*Model Used:* @Data public class Employee { //private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public class Zip { private int zip; private int ext; } private List<Employee> *getInputDataToSerialize*(){ Address address = new Address(); address.setStreetName("Murry Ridge Dr"); address.setCity("Murrysville"); Zip zip = new Zip(); zip.setZip(15668); zip.setExt(1234); address.setZip(zip); List<Employee> employees = new ArrayList<>(); IntStream.range(0, 100000).forEach(i->{ Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); employee.setAge(20); employee.setName("Test"+i); employee.setAddress(address); employees.add(employee); }); return employees; } *Note:* *I have tried to save the data into local file system as well as AWS S3, but both are having same result - very slow.* was: Hi, I am doing a POC to compare different data formats and its performance in terms of serialization/deserialization speed, storage size, compatibility between different language etc. When I try to serialize a simple java object to parquet file, it takes _*6-7 seconds*_ vs same object's serialization to JSON is *_100 milliseconds._* Could you help me to resolve this issue? +*My Configuration and code snippet:* *Gradle dependencies* dependencies { compile group: 'org.springframework.boot', name: 'spring-boot-starter' compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' compile group: 'joda-time', name: 'joda-time' compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', version: '2.6.5' } *Code snippet:*+ public void serialize(List<D> inputDataToSerialize, CompressionCodecName compressionCodecName) throws IOException { Path path = new Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet"); Class clazz = inputDataToSerialize.get(0).getClass(); try (ParquetWriter<D> writer = *AvroParquetWriter.*<D>builder(path1) .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable fields .withDataModel(ReflectData.get()) .withConf(parquetConfiguration) .withCompressionCodec(compressionCodecName) .withWriteMode(OVERWRITE) .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) .build()) { for (D input : inputDataToSerialize) { writer.write(input); } } } +*Model Used:* @Data public class Employee { //private UUID id; private String name; private int age; private Address address; } @Data public class Address { private String streetName; private String city; private Zip zip; } @Data public class Zip { private int zip; private int ext; } + *Note:* *I have tried to save the data into local file system as well as AWS S3, but both are having same result - very slow.* > Parquet Java Serialization is very slow > ---------------------------------------- > > Key: PARQUET-1680 > URL: https://issues.apache.org/jira/browse/PARQUET-1680 > Project: Parquet > Issue Type: Bug > Components: parquet-avro > Affects Versions: 1.10.1 > Reporter: Felix Kizhakkel Jose > Priority: Major > > Hi, > I am doing a POC to compare different data formats and its performance in > terms of serialization/deserialization speed, storage size, compatibility > between different language etc. > When I try to serialize a simple java object to parquet file, it takes > _*6-7 seconds*_ vs same object's serialization to JSON is *_100 > milliseconds._* > Could you help me to resolve this issue? > +*My Configuration and code snippet:* > *Gradle dependencies* > dependencies > { compile group: 'org.springframework.boot', name: 'spring-boot-starter' > compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile > group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' > compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' > compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' > compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' > compile group: 'joda-time', name: 'joda-time' compile group: > 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' > compile group: 'com.fasterxml.jackson.datatype', name: > 'jackson-datatype-joda', version: '2.6.5' } > *Code snippet:*+ > public void serialize(List<D> inputDataToSerialize, CompressionCodecName > compressionCodecName) throws IOException { > Path path = new > Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet"); > Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet"); > Class clazz = inputDataToSerialize.get(0).getClass(); > try (ParquetWriter<D> writer = *AvroParquetWriter.*<D>builder(path1) > .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate > nullable fields > .withDataModel(ReflectData.get()) > .withConf(parquetConfiguration) > .withCompressionCodec(compressionCodecName) > .withWriteMode(OVERWRITE) > .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) > .build()) { > for (D input : inputDataToSerialize) > { writer.write(input); } > } > } > +*Model Used:* > @Data > public class Employee > { //private UUID id; private String name; private int age; private Address > address; } > @Data > public class Address > { private String streetName; private String city; private Zip zip; } > @Data > public class Zip > { private int zip; private int ext; } > > private List<Employee> *getInputDataToSerialize*(){ > Address address = new Address(); > address.setStreetName("Murry Ridge Dr"); > address.setCity("Murrysville"); > Zip zip = new Zip(); > zip.setZip(15668); > zip.setExt(1234); > address.setZip(zip); > List<Employee> employees = new ArrayList<>(); > IntStream.range(0, 100000).forEach(i->{ > Employee employee = new Employee(); > // employee.setId(UUID.randomUUID()); > employee.setAge(20); > employee.setName("Test"+i); > employee.setAddress(address); > employees.add(employee); > }); > return employees; > } > *Note:* > *I have tried to save the data into local file system as well as AWS S3, but > both are having same result - very slow.* -- This message was sent by Atlassian Jira (v8.3.4#803005)