[jira] [Updated] (PARQUET-1680) Parquet Java Serialization is very slow

Felix Kizhakkel Jose (Jira) Thu, 17 Oct 2019 06:45:15 -0700


     [ 
https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Felix Kizhakkel Jose updated PARQUET-1680:
------------------------------------------
    Description: 
Hi,
 I am doing a POC to compare different data formats and its performance in 
terms of serialization/deserialization speed, storage size, compatibility 
between different language etc. 
 When I try to serialize a simple java object to parquet file,  it takes _*6-7 
seconds*_ vs same object's serialization to JSON is *_100 milliseconds._*

Could you help me to resolve this issue?

+*My Configuration and code snippet:*
 *Gradle dependencies*
 dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Code snippet:*+

public void serialize(List<D> inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new 
Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
 Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter<D> writer = *AvroParquetWriter.*<D>builder(path1)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

}
 }

+*Model Used:*
 @Data
 public class Employee

{ //private UUID id; private String name; private int age; private Address 
address; }

@Data
 public class Address

{ private String streetName; private String city; private Zip zip; }

@Data
 public class Zip

{ private int zip; private int ext; }

 

private List<Employee> *getInputDataToSerialize*(){
 Address address = new Address();
 address.setStreetName("Murry Ridge Dr");
 address.setCity("Murrysville");
 Zip zip = new Zip();
 zip.setZip(15668);
 zip.setExt(1234);

 address.setZip(zip);

 List<Employee> employees = new ArrayList<>();

 IntStream.range(0, 100000).forEach(i->{
 Employee employee = new Employee();
 // employee.setId(UUID.randomUUID());
 employee.setAge(20);
 employee.setName("Test"+i);
 employee.setAddress(address);
 employees.add(employee);
 });
return employees;
}



*Note:*
 *I have tried to save the data into local file system as well as AWS S3, but 
both are having same result - very slow.*

  was:
Hi,
 I am doing a POC to compare different data formats and its performance in 
terms of serialization/deserialization speed, storage size, compatibility 
between different language etc. 
 When I try to serialize a simple java object to parquet file,  it takes _*6-7 
seconds*_ vs same object's serialization to JSON is *_100 milliseconds._*

Could you help me to resolve this issue?

+*My Configuration and code snippet:*
 *Gradle dependencies*
 dependencies

{ compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
compile group: 'joda-time', name: 'joda-time' compile group: 
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda', 
version: '2.6.5' }

*Code snippet:*+

public void serialize(List<D> inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

Path path = new 
Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
 Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

try (ParquetWriter<D> writer = *AvroParquetWriter.*<D>builder(path1)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

for (D input : inputDataToSerialize)

{ writer.write(input); }

}
 }

+*Model Used:*
 @Data
 public class Employee

{ //private UUID id; private String name; private int age; private Address 
address; }

@Data
 public class Address

{ private String streetName; private String city; private Zip zip; }

@Data
 public class Zip

{ private int zip; private int ext; }

+

*Note:*
 *I have tried to save the data into local file system as well as AWS S3, but 
both are having same result - very slow.*


> Parquet Java Serialization is  very slow
> ----------------------------------------
>
>                 Key: PARQUET-1680
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1680
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.10.1
>            Reporter: Felix Kizhakkel Jose
>            Priority: Major
>
> Hi,
>  I am doing a POC to compare different data formats and its performance in 
> terms of serialization/deserialization speed, storage size, compatibility 
> between different language etc. 
>  When I try to serialize a simple java object to parquet file,  it takes 
> _*6-7 seconds*_ vs same object's serialization to JSON is *_100 
> milliseconds._*
> Could you help me to resolve this issue?
> +*My Configuration and code snippet:*
>  *Gradle dependencies*
>  dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Code snippet:*+
> public void serialize(List<D> inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter<D> writer = *AvroParquetWriter.*<D>builder(path1)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> +*Model Used:*
>  @Data
>  public class Employee
> { //private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> private List<Employee> *getInputDataToSerialize*(){
>  Address address = new Address();
>  address.setStreetName("Murry Ridge Dr");
>  address.setCity("Murrysville");
>  Zip zip = new Zip();
>  zip.setZip(15668);
>  zip.setExt(1234);
>  address.setZip(zip);
>  List<Employee> employees = new ArrayList<>();
>  IntStream.range(0, 100000).forEach(i->{
>  Employee employee = new Employee();
>  // employee.setId(UUID.randomUUID());
>  employee.setAge(20);
>  employee.setName("Test"+i);
>  employee.setAddress(address);
>  employees.add(employee);
>  });
> return employees;
> }
> *Note:*
>  *I have tried to save the data into local file system as well as AWS S3, but 
> both are having same result - very slow.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (PARQUET-1680) Parquet Java Serialization is very slow

Reply via email to