[
https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Felix Kizhakkel Jose updated PARQUET-1680:
------------------------------------------
Description:
Hi,
I am doing a POC to compare different data formats and its performance in
terms of serialization/deserialization speed, storage size, compatibility
between different language etc.
When I try to serialize a simple java object to parquet file, it takes _*6-7
seconds*_ vs same object's serialization to JSON is *_100 milliseconds._*
Could you help me to resolve this issue?
+*My Configuration and code snippet:*
*Gradle dependencies*
dependencies
{ compile group: 'org.springframework.boot', name: 'spring-boot-starter'
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271'
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0'
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1'
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1'
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1'
compile group: 'joda-time', name: 'joda-time' compile group:
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5'
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda',
version: '2.6.5' }
*Code snippet:*+
public void serialize(List<D> inputDataToSerialize, CompressionCodecName
compressionCodecName) throws IOException {
Path path = new
Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet");
Class clazz = inputDataToSerialize.get(0).getClass();
try (ParquetWriter<D> writer = *AvroParquetWriter.*<D>builder(path1)
.withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable
fields
.withDataModel(ReflectData.get())
.withConf(parquetConfiguration)
.withCompressionCodec(compressionCodecName)
.withWriteMode(OVERWRITE)
.withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
.build()) {
for (D input : inputDataToSerialize)
{ writer.write(input); }
}
}
+*Model Used:*
@Data
public class Employee
{ //private UUID id; private String name; private int age; private Address
address; }
@Data
public class Address
{ private String streetName; private String city; private Zip zip; }
@Data
public class Zip
{ private int zip; private int ext; }
private List<Employee> *getInputDataToSerialize*(){
Address address = new Address();
address.setStreetName("Murry Ridge Dr");
address.setCity("Murrysville");
Zip zip = new Zip();
zip.setZip(15668);
zip.setExt(1234);
address.setZip(zip);
List<Employee> employees = new ArrayList<>();
IntStream.range(0, 100000).forEach(i->{
Employee employee = new Employee();
// employee.setId(UUID.randomUUID());
employee.setAge(20);
employee.setName("Test"+i);
employee.setAddress(address);
employees.add(employee);
});
return employees;
}
*Note:*
*I have tried to save the data into local file system as well as AWS S3, but
both are having same result - very slow.*
was:
Hi,
I am doing a POC to compare different data formats and its performance in
terms of serialization/deserialization speed, storage size, compatibility
between different language etc.
When I try to serialize a simple java object to parquet file, it takes _*6-7
seconds*_ vs same object's serialization to JSON is *_100 milliseconds._*
Could you help me to resolve this issue?
+*My Configuration and code snippet:*
*Gradle dependencies*
dependencies
{ compile group: 'org.springframework.boot', name: 'spring-boot-starter'
compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile
group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271'
compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0'
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1'
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1'
compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1'
compile group: 'joda-time', name: 'joda-time' compile group:
'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5'
compile group: 'com.fasterxml.jackson.datatype', name: 'jackson-datatype-joda',
version: '2.6.5' }
*Code snippet:*+
public void serialize(List<D> inputDataToSerialize, CompressionCodecName
compressionCodecName) throws IOException {
Path path = new
Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet");
Class clazz = inputDataToSerialize.get(0).getClass();
try (ParquetWriter<D> writer = *AvroParquetWriter.*<D>builder(path1)
.withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate nullable
fields
.withDataModel(ReflectData.get())
.withConf(parquetConfiguration)
.withCompressionCodec(compressionCodecName)
.withWriteMode(OVERWRITE)
.withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
.build()) {
for (D input : inputDataToSerialize)
{ writer.write(input); }
}
}
+*Model Used:*
@Data
public class Employee
{ //private UUID id; private String name; private int age; private Address
address; }
@Data
public class Address
{ private String streetName; private String city; private Zip zip; }
@Data
public class Zip
{ private int zip; private int ext; }
+
*Note:*
*I have tried to save the data into local file system as well as AWS S3, but
both are having same result - very slow.*
> Parquet Java Serialization is very slow
> ----------------------------------------
>
> Key: PARQUET-1680
> URL: https://issues.apache.org/jira/browse/PARQUET-1680
> Project: Parquet
> Issue Type: Bug
> Components: parquet-avro
> Affects Versions: 1.10.1
> Reporter: Felix Kizhakkel Jose
> Priority: Major
>
> Hi,
> I am doing a POC to compare different data formats and its performance in
> terms of serialization/deserialization speed, storage size, compatibility
> between different language etc.
> When I try to serialize a simple java object to parquet file, it takes
> _*6-7 seconds*_ vs same object's serialization to JSON is *_100
> milliseconds._*
> Could you help me to resolve this issue?
> +*My Configuration and code snippet:*
> *Gradle dependencies*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter'
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271'
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0'
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1'
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1'
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1'
> compile group: 'joda-time', name: 'joda-time' compile group:
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5'
> compile group: 'com.fasterxml.jackson.datatype', name:
> 'jackson-datatype-joda', version: '2.6.5' }
> *Code snippet:*+
> public void serialize(List<D> inputDataToSerialize, CompressionCodecName
> compressionCodecName) throws IOException {
> Path path = new
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
> Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet");
> Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter<D> writer = *AvroParquetWriter.*<D>builder(path1)
> .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate
> nullable fields
> .withDataModel(ReflectData.get())
> .withConf(parquetConfiguration)
> .withCompressionCodec(compressionCodecName)
> .withWriteMode(OVERWRITE)
> .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
> .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
> }
> +*Model Used:*
> @Data
> public class Employee
> { //private UUID id; private String name; private int age; private Address
> address; }
> @Data
> public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
> public class Zip
> { private int zip; private int ext; }
>
> private List<Employee> *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List<Employee> employees = new ArrayList<>();
> IntStream.range(0, 100000).forEach(i->{
> Employee employee = new Employee();
> // employee.setId(UUID.randomUUID());
> employee.setAge(20);
> employee.setName("Test"+i);
> employee.setAddress(address);
> employees.add(employee);
> });
> return employees;
> }
> *Note:*
> *I have tried to save the data into local file system as well as AWS S3, but
> both are having same result - very slow.*
--
This message was sent by Atlassian Jira
(v8.3.4#803005)