Felix Kizhakkel Jose created PARQUET-1680:
---------------------------------------------

             Summary: Parquet Java Serialization is  very slow
                 Key: PARQUET-1680
                 URL: https://issues.apache.org/jira/browse/PARQUET-1680
             Project: Parquet
          Issue Type: Bug
          Components: parquet-avro
    Affects Versions: 1.10.1
            Reporter: Felix Kizhakkel Jose


Hi,
I am doing a POC to compare different data formats and its performance in terms 
of serialization/deserialization speed, storage size, compatibility between 
different language etc. 
When I try to serialize a simple java object to parquet file,  it takes 6-7 
seconds vs same object's serialization to JSON is 100 milliseconds.

Could you help me to resolve this issue?

+*My Configuration and code snippet:*
*Gradle dependencies*
dependencies {
 compile group: 'org.springframework.boot', name: 'spring-boot-starter'
 compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6'

 compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: 
'1.11.271'
 compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0'
 compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1'
 compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1'
 compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1'
 compile group: 'joda-time', name: 'joda-time'
 compile group: 'com.fasterxml.jackson.core', name: 'jackson-databind', 
version: '2.6.5'
 compile group: 'com.fasterxml.jackson.datatype', name: 
'jackson-datatype-joda', version: '2.6.5'
}

*Code snippet:*+

public void serialize(List<D> inputDataToSerialize, CompressionCodecName 
compressionCodecName) throws IOException {

 Path path = new Path("s3a://parquetpoc/data_"+compressionCodecName+".parquet");
 Path path1 = new 
Path("/Users/felixkizhakkeljose/Downloads/data_"+compressionCodecName+".parquet");
 Class clazz = inputDataToSerialize.get(0).getClass();

 try (ParquetWriter<D> writer = *AvroParquetWriter.*<D>builder(path1)
 .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable 
fields
 .withDataModel(ReflectData.get())
 .withConf(parquetConfiguration)
 .withCompressionCodec(compressionCodecName)
 .withWriteMode(OVERWRITE)
 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
 .build()) {

 for (D input : inputDataToSerialize) {
 writer.write(input);
 }
 }
}

*Note:*
*I have tried to save the data into local file system as well as AWS S3, but 
both are having same result - very slow.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to