[jira] [Updated] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-10-18 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-1681:
-
Description: 
When using the Avro schema below to write a parquet(1.8.1) file and then read 
back by using parquet 1.10.1 without passing any schema, the reading throws an 
exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 

   {

  "name": "phones",

  "type": [

"null",

{

  "type": "array",

  "items": {

"type": "record",

"name": "phones_items",

"fields": [

  

{ "name": "phone_number", 
"type": [   "null",   "string"  
   ], "default": null   
}

]

  }

}

  ],

  "default": null

}

The code to read is as below 

 val reader = 
AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
Configuration).build()

reader.read()

PARQUET-651 changed the method isElementType() by relying on Avro's 
checkReaderWriterCompatibility() to check the compatibility. However, 
checkReaderWriterCompatibility() consider the ParquetSchema and the 
AvroSchema(converted from File schema) as not compatible(the name in avro 
schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence not 
compatible) . Hence return false and caused the “phone_number” field in the 
above schema to be considered as group type which is not true. Then the 
exception throws as .asGroupType(). 

I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
not. But it could because the translation of Avro schema to Parquet schema is 
not changed(didn’t verify yet). 

 I hesitate to revert PARQUET-651 because it solved several problems. I would 
like to hear the community's thoughts on it. 

  was:
When using the Avro schema below to write a parquet(1.8.1) file and then read 
back by using parquet 1.10.1 without passing any schema, the reading throws an 
exception "XXX is not a group" . Reading throw parquet 1.8.1 is fine. 

   {

  "name": "phones",

  "type": [

"null",

{

  "type": "array",

  "items": {

"type": "record",

"name": "phones_items",

"fields": [

  {

"name": "phone_number",

"type": [

  "null",

  "string"

],

"default": null

  }

]

  }

}

  ],

  "default": null

}

The code to read is as below 

 val reader = 
AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
Configuration).build()

reader.read()

PARQUET-651 changed the method isElementType() by relying on Avro's 
checkReaderWriterCompatibility() to check the compatibility. However, 
checkReaderWriterCompatibility() consider the ParquetSchema and the 
AvroSchema(converted from File schema) as not compatible(the name in avro 
schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence not 
compatible) . Hence return false and caused the “phone_number” field in the 
above schema to be considered as group type which is not true. Then the 
exception throws as .asGroupType(). 

I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
not. But it could because the translation of Avro schema to Parquet schema is 
not changed(didn’t verify yet). 

 I hesitate to revert PARQUET-651 because it solved several problems. I would 
like to hear the community's thoughts on it. 


> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Priority: Critical
> Fix For: 1.11.0
>
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
>  

[jira] [Comment Edited] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-18 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954944#comment-16954944
 ] 

Felix Kizhakkel Jose edited comment on PARQUET-1679 at 10/18/19 8:08 PM:
-

Hi [~q.xu], 
 Thank you for the quick response. Could you please give me a sample or could 
you give me snippet on what you mentioned?

PS: The model I have is already defined and used by other consumers, so I 
cannot modify the model


was (Author: felixkjose):
Hi [~q.xu], 
Thank you for the quick response. Could you please give me a sample or could 
you give me snippet on what you mentioned?

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-18 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954944#comment-16954944
 ] 

Felix Kizhakkel Jose commented on PARQUET-1679:
---

Hi [~q.xu], 
Thank you for the quick response. Could you please give me a sample or could 
you give me snippet on what you mentioned?

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-18 Thread Qinghui Xu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954937#comment-16954937
 ] 

Qinghui Xu commented on PARQUET-1679:
-

It seems that your schema builder (`ReflectData`) takes `UUID` as an ordinary 
class, and tries to use the its fields as part of the schema. Maybe you should 
use a byte array for the uuid field.

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-18 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954922#comment-16954922
 ] 

Felix Kizhakkel Jose commented on PARQUET-1679:
---

org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an 
empty group: required group id 
\{org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with 
an empty group: required group id {} at 
org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27) at 
org.apache.parquet.schema.GroupType.accept(GroupType.java:226) at 
org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:31) at 
org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37) at 
org.apache.parquet.schema.MessageType.accept(MessageType.java:55) at 
org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23) at 
org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:233) 
at org.apache.parquet.hadoop.ParquetWriter.(ParquetWriter.java:280) at 
org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:535) 
at 
com.philips.felix.parquet.ParquetDataSerializer.serialize(ParquetDataSerializer.java:64)
 at com.philips.felix.parquet.Application.run(Application.java:62) at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
 at 
org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
 at 
org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
 at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) 
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
at com.philips.felix.parquet.Application.main(Application.java:37)

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee

[jira] [Created] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-10-18 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1681:


 Summary: Avro's isElementType() change breaks the reading of some 
parquet(1.8.1) files
 Key: PARQUET-1681
 URL: https://issues.apache.org/jira/browse/PARQUET-1681
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-avro
Affects Versions: 1.10.0, 1.9.1, 1.11.0
Reporter: Xinli Shang
 Fix For: 1.11.0


When using the Avro schema below to write a parquet(1.8.1) file and then read 
back by using parquet 1.10.1 without passing any schema, the reading throws an 
exception "XXX is not a group" . Reading throw parquet 1.8.1 is fine. 

   {

  "name": "phones",

  "type": [

"null",

{

  "type": "array",

  "items": {

"type": "record",

"name": "phones_items",

"fields": [

  {

"name": "phone_number",

"type": [

  "null",

  "string"

],

"default": null

  }

]

  }

}

  ],

  "default": null

}

The code to read is as below 

 val reader = 
AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
Configuration).build()

reader.read()

PARQUET-651 changed the method isElementType() by relying on Avro's 
checkReaderWriterCompatibility() to check the compatibility. However, 
checkReaderWriterCompatibility() consider the ParquetSchema and the 
AvroSchema(converted from File schema) as not compatible(the name in avro 
schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence not 
compatible) . Hence return false and caused the “phone_number” field in the 
above schema to be considered as group type which is not true. Then the 
exception throws as .asGroupType(). 

I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
not. But it could because the translation of Avro schema to Parquet schema is 
not changed(didn’t verify yet). 

 I hesitate to revert PARQUET-651 because it solved several problems. I would 
like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-18 Thread Qinghui Xu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954915#comment-16954915
 ] 

Qinghui Xu commented on PARQUET-1679:
-

Do you have a stacktrace or something?

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Help on Parquet Write Slowness and UUID support

2019-10-18 Thread Kizhakkel Jose, Felix
Hello,

I am from Philips Architecture team, where I am working on a POC to compare 
different data models [ Parquet/Avro/Json]. But I see Parquet is very slow 
while writing [pojo to Parquet file].

I have created two issues in Parquet project. One is regarding the slowness of 
ParquetWritter compared to JSON and AvroWriter :

Avro Serialization Stats: StopWatch 'AvroSerializer': running time (millis) = 
387
JSON serialization Stats: StopWatch 'JsonSerializer': running time (millis) = 
103
Parquet Serialization Stats: StopWatch 'ParquetSerializer': running time 
(millis) = 8346

https://issues.apache.org/jira/browse/PARQUET-1680

Second issue is I was not able to serialize a Java object to Parquet when the 
pojo has a UUID field. Parquet is throwing exception.
https://issues.apache.org/jira/browse/PARQUET-1679

Could you please help me on what I am doing wrong or give me some insights on 
resolving the issue.

Regards,
Felix K Jose


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.


Re: Updating parquet web site

2019-10-18 Thread Driesprong, Fokko
Great work!

Op vr 18 okt. 2019 om 17:53 schreef Ryan Blue 

> Sounds good to me! Thanks for taking care of this.
>
> On Fri, Oct 18, 2019 at 1:44 AM Gabor Szadovszky  wrote:
>
> > Hi Uwe,
> >
> > parquet-site sounds good to me.
> >
> > Cheers,
> > Gabor
> >
> > On Fri, Oct 18, 2019 at 10:19 AM Uwe L. Korn  wrote:
> >
> > > Hello Gabor,
> > >
> > > can we call this for clarity  https://github.com/apache/parquet-site ?
> > >
> > > Thanks
> > > Uwe
> > >
> > > On Fri, Oct 18, 2019, at 9:46 AM, Gabor Szadovszky wrote:
> > > > Dear All,
> > > >
> > > > There are some stuff on our web site that is ready for update (since
> a
> > > > while). To spin up the process it would be great if we could follow
> the
> > > > same git PR process we already have for our existing git repos. Jim
> has
> > > > already created PARQUET-1675
> > > >  for moving the
> > > > existing svn repo to git.
> > > >
> > > > If there are no objections I will create an infra ticket to move the
> > svn
> > > > repo https://svn.apache.org/repos/asf/parquet to the new git
> > repository
> > > > https://github.com/apache/parquet.
> > > >
> > > > Regards,
> > > > Gabor
> > > >
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


[jira] [Updated] (PARQUET-1678) [C++] Provide classes for reading/writing using input/output operators

2019-10-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1678:

Labels: pull-request-available  (was: )

> [C++] Provide classes for reading/writing using input/output operators
> --
>
> Key: PARQUET-1678
> URL: https://issues.apache.org/jira/browse/PARQUET-1678
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Gawain BOLTON
>Priority: Major
>  Labels: pull-request-available
>
> The current Parquet APIs allow for reading/writing data using either:
>  # A high level API whereby all data for an each column is given to an 
> arrow::*Builder class.
>  # Or a low-level API using parquet::*Writer classes which allows for a 
> column to be selected and data items added to the column as needed.
> Using the low-level approach gives great flexibility but makes for cumbersome 
> code and requires casting each column to the required  type.
> I propose offering StreamReader and StreamWriter classes with C++ 
> input/output operators allowing for data to be written like this:
> {code:java}
> // N.B. schema has 3 columns of type std::string, std::int32_t and float.
> auto file_writer{ parquet:ParquetFileWriter::Open(...) };
> StreamWriter sw{ file_writer };
> // Write to output file using output operator.
> sw << "A string" << 3 << 4.5f;
> sw.nextRow();
> ...{code}
>  
> Similary reading would be done as follows:
> {code:java}
> auto file_reader{ parquet::ParquetFileReader::Open(...) };
> StreamReader sr{ file_reader };
> std::string s; std::int32_t i; float f;
> sr >> s >> i >> f;
> sr.nextRow();{code}
> I have written such classes and an example file which shows how to use them.
> I think that they allow for a more simple and natural API since:
>  * No casting is needed.
>  * Code is simple, easy to read.
>  * User defined types are easily be accommodated by having the user provide 
> the input/output operator for the type.
>  * Row groups can be created "automatically" when a given amount of user data 
> has been written, or explicitly by a StreamWriter method such as 
> "createNewRowGroup()"
> I have created this ticket because where I work (www.cfm.fr) we are very 
> interested in using Parquet, but our users have requested a stream like API.  
>  We think others might also be interested in this functionality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Updating parquet web site

2019-10-18 Thread Ryan Blue
Sounds good to me! Thanks for taking care of this.

On Fri, Oct 18, 2019 at 1:44 AM Gabor Szadovszky  wrote:

> Hi Uwe,
>
> parquet-site sounds good to me.
>
> Cheers,
> Gabor
>
> On Fri, Oct 18, 2019 at 10:19 AM Uwe L. Korn  wrote:
>
> > Hello Gabor,
> >
> > can we call this for clarity  https://github.com/apache/parquet-site ?
> >
> > Thanks
> > Uwe
> >
> > On Fri, Oct 18, 2019, at 9:46 AM, Gabor Szadovszky wrote:
> > > Dear All,
> > >
> > > There are some stuff on our web site that is ready for update (since a
> > > while). To spin up the process it would be great if we could follow the
> > > same git PR process we already have for our existing git repos. Jim has
> > > already created PARQUET-1675
> > >  for moving the
> > > existing svn repo to git.
> > >
> > > If there are no objections I will create an infra ticket to move the
> svn
> > > repo https://svn.apache.org/repos/asf/parquet to the new git
> repository
> > > https://github.com/apache/parquet.
> > >
> > > Regards,
> > > Gabor
> > >
> >
>


-- 
Ryan Blue
Software Engineer
Netflix


[jira] [Commented] (PARQUET-1679) Invalid SchemaException for UUID while using AvroParquetWriter

2019-10-18 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954634#comment-16954634
 ] 

Felix Kizhakkel Jose commented on PARQUET-1679:
---

Could someone please help me on this? I am totally blocked with my analysis of 
data format comparison because UUID is a mandatory field in all my data models

> Invalid SchemaException for UUID while using AvroParquetWriter
> --
>
> Key: PARQUET-1679
> URL: https://issues.apache.org/jira/browse/PARQUET-1679
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
> I am getting org.apache.parquet.schema.InvalidSchemaException: Cannot write a 
> schema with an empty group: optional group id {} while I include a UUID field 
> on my POJO object. Without UUID everything worked fine. I have seen Parquet 
> suports UUID as part of [#PR-71] on 2.4 release. 
>  But I am getting InvalidSchemaException on UUID. Is there anything that I am 
> missing or its a known issue?
> *My setup details:*
> *gradle dependency :*
> dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Model used:*
> @Data
>  public class Employee
> { private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> +*My Serializer Code:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = AvroParquetWriter.builder(path)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> private List *getInputDataToSerialize*(){
> Address address = new Address();
> address.setStreetName("Murry Ridge Dr");
> address.setCity("Murrysville");
> Zip zip = new Zip();
> zip.setZip(15668);
> zip.setExt(1234);
> address.setZip(zip);
> List employees = new ArrayList<>();
> IntStream.range(0, 10).forEach(i->
> { Employee employee = new Employee(); // employee.setId(UUID.randomUUID()); 
> employee.setAge(20); employee.setName("Test"+i); 
> employee.setAddress(address); employees.add(employee); }
> );
> return employees;
> }
> _**Where generic Type D is Employee_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1680) Parquet Java Serialization is very slow

2019-10-18 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954631#comment-16954631
 ] 

Felix Kizhakkel Jose commented on PARQUET-1680:
---

Could someone please help me on this?

> Parquet Java Serialization is  very slow
> 
>
> Key: PARQUET-1680
> URL: https://issues.apache.org/jira/browse/PARQUET-1680
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hi,
>  I am doing a POC to compare different data formats and its performance in 
> terms of serialization/deserialization speed, storage size, compatibility 
> between different language etc. 
>  When I try to serialize a simple java object to parquet file,  it takes 
> _*6-7 seconds*_ vs same object's serialization to JSON is *_100 
> milliseconds._*
> Could you help me to resolve this issue?
> +*My Configuration and code snippet:*
>  *Gradle dependencies*
>  dependencies
> { compile group: 'org.springframework.boot', name: 'spring-boot-starter' 
> compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6' compile 
> group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271' 
> compile group: 'org.apache.parquet', name: 'parquet-avro', version: '1.10.0' 
> compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1' 
> compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.1.1' 
> compile group: 'joda-time', name: 'joda-time' compile group: 
> 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.5' 
> compile group: 'com.fasterxml.jackson.datatype', name: 
> 'jackson-datatype-joda', version: '2.6.5' }
> *Code snippet:*+
> public void serialize(List inputDataToSerialize, CompressionCodecName 
> compressionCodecName) throws IOException {
> Path path = new 
> Path("s3a://parquetpoc/data_"++compressionCodecName++".parquet");
>  Path path1 = new Path("/Downloads/data_"++compressionCodecName++".parquet");
>  Class clazz = inputDataToSerialize.get(0).getClass();
> try (ParquetWriter writer = *AvroParquetWriter.*builder(path1)
>  .withSchema(ReflectData.AllowNull.get().getSchema(clazz)) // generate 
> nullable fields
>  .withDataModel(ReflectData.get())
>  .withConf(parquetConfiguration)
>  .withCompressionCodec(compressionCodecName)
>  .withWriteMode(OVERWRITE)
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>  .build()) {
> for (D input : inputDataToSerialize)
> { writer.write(input); }
> }
>  }
> +*Model Used:*
>  @Data
>  public class Employee
> { //private UUID id; private String name; private int age; private Address 
> address; }
> @Data
>  public class Address
> { private String streetName; private String city; private Zip zip; }
> @Data
>  public class Zip
> { private int zip; private int ext; }
>  
> private List *getInputDataToSerialize*(){
>  Address address = new Address();
>  address.setStreetName("Murry Ridge Dr");
>  address.setCity("Murrysville");
>  Zip zip = new Zip();
>  zip.setZip(15668);
>  zip.setExt(1234);
>  address.setZip(zip);
>  List employees = new ArrayList<>();
>  IntStream.range(0, 10).forEach(i->{
>  Employee employee = new Employee();
>  // employee.setId(UUID.randomUUID());
>  employee.setAge(20);
>  employee.setName("Test"+i);
>  employee.setAddress(address);
>  employees.add(employee);
>  });
> return employees;
> }
> *Note:*
>  *I have tried to save the data into local file system as well as AWS S3, but 
> both are having same result - very slow.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Working on 1.11.0 RC7

2019-10-18 Thread Driesprong, Fokko
Perfect, thanks Gabor.

Cheers, Fokko

Op vr 18 okt. 2019 om 14:24 schreef Gabor Szadovszky :

> Hi Fokko,
>
> There is no separate branch. Based on the discussion on the yesterday
> parquet sync 1.11.0 is planned to be released from master.
>
> Cheers,
> Gabor
>
> Driesprong, Fokko  ezt írta (időpont: 2019. okt.
> 18.,
> P 14:09):
>
> > Thanks for doing the release Gabor,
> >
> > Is there a branch for 1.11.0? Please let me know.
> >
> > Cheers, Fokko
> >
> > Op vr 18 okt. 2019 om 09:55 schreef Gabor Szadovszky :
> >
> > > Dear All,
> > >
> > > In the next couple of weeks I'll be working on the next release
> candidate
> > > of 1.11.0. If you have any ongoing issues that you think will be nice
> to
> > > have in 1.11.0, please set "Fix Version/s" accordingly. (If it is not
> > > really targeted to 1.11.0, please, remove the related tag.)
> > > If you think 1.11.0 cannot be released without a specific fix, please
> set
> > > the mentioned tag accordingly and also add the jira as a blocker to
> > > PARQUET-1434 .
> > >
> > > Thanks a lot,
> > > Gabor
> > >
> >
>


[jira] [Commented] (PARQUET-1496) [Java] Update Scala for JDK 11 compatibility

2019-10-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954596#comment-16954596
 ] 

ASF GitHub Bot commented on PARQUET-1496:
-

xhochy commented on pull request #605: PARQUET-1496: Update Scala to 2.12
URL: https://github.com/apache/parquet-mr/pull/605
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Update Scala for JDK 11 compatibility
> 
>
> Key: PARQUET-1496
> URL: https://issues.apache.org/jira/browse/PARQUET-1496
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
>
> When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
> the build fails for me in {{parquet-scala}} with:
> {code:java}
> [INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 
> ---
> [INFO] Checking for multiple versions of scala
> [INFO] includes = [**/*.java,**/*.scala,]
> [INFO] excludes = []
> [INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
> info: compiling
> [INFO] Compiling 1 source files to 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
> 1547922718010
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/package.class)
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/runtime/package.class)
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object 
> java.lang.Object in compiler mirror not found.
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getClassByName(Mirrors.scala:99)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getRequiredClass(Mirrors.scala:102)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass$lzycompute(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass$lzycompute(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.AnyRefClass(Definitions.scala:263)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.specialPolyClass(Definitions.scala:1120)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass$lzycompute(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.RepeatedParamClass(Definitions.scala:407)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1154)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
> [INFO] at scala.tools.nsc.Global$Run.(Global.scala:1290)
> [INFO] at scala.tools.nsc.Driver.doCompile(Driver.scala:32)
> [INFO] at scala.tools.nsc.Main$.doCompile(Main.scala:79)
> [INFO] at scala.tools.nsc.Driver.process(Driver.scala:54)
> [INFO] at scala.tools.nsc.Driver.main(Driver.scala:67)
> [INFO] at scala.tools.nsc.Main.main(Main.scala)
> [INFO] at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [INFO] at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodA

[jira] [Commented] (PARQUET-1496) [Java] Update Scala for JDK 11 compatibility

2019-10-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954565#comment-16954565
 ] 

ASF GitHub Bot commented on PARQUET-1496:
-

Fokko commented on pull request #693: PARQUET-1496: Update Scala to 2.12
URL: https://github.com/apache/parquet-mr/pull/693
 
 
   Make sure you have checked _all_ steps below. Updated the tests a bit as 
well. There were some conflicts between both `test.thrift` files. I've 
consolidated them to avoid issues.
   
   Updating to Scala 2.12 is required to support Java 11.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Update Scala for JDK 11 compatibility
> 
>
> Key: PARQUET-1496
> URL: https://issues.apache.org/jira/browse/PARQUET-1496
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
>
> When trying to build the parquet-mr code on OSX Mojave with OpenJDK 10 and 9, 
> the build fails for me in {{parquet-scala}} with:
> {code:java}
> [INFO] --- maven-scala-plugin:2.15.2:compile (default) @ parquet-scala_2.10 
> ---
> [INFO] Checking for multiple versions of scala
> [INFO] includes = [**/*.java,**/*.scala,]
> [INFO] excludes = []
> [INFO] /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/src/main/scala:-1: 
> info: compiling
> [INFO] Compiling 1 source files to 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-scala/target/classes at 
> 1547922718010
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/package.class)
> [ERROR] error: error while loading package, Missing dependency 'object 
> java.lang.Object in compiler mirror', required by 
> /Users/uwe/.m2/repository/org/scala-lang/scala-library/2.10.6/scala-library-2.10.6.jar(scala/runtime/package.class)
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object 
> java.lang.Object in compiler mirror not found.
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getClassByName(Mirrors.scala:99)
> [INFO] at 
> scala.reflect.internal.Mirrors$RootsBase.getRequiredClass(Mirrors.scala:102)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass$lzycompute(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.ObjectClass(Definitions.scala:264)
> [INFO] at 
> scala.reflect.internal.Definitions$DefinitionsClass.

Re: Working on 1.11.0 RC7

2019-10-18 Thread Gabor Szadovszky
Hi Fokko,

There is no separate branch. Based on the discussion on the yesterday
parquet sync 1.11.0 is planned to be released from master.

Cheers,
Gabor

Driesprong, Fokko  ezt írta (időpont: 2019. okt. 18.,
P 14:09):

> Thanks for doing the release Gabor,
>
> Is there a branch for 1.11.0? Please let me know.
>
> Cheers, Fokko
>
> Op vr 18 okt. 2019 om 09:55 schreef Gabor Szadovszky :
>
> > Dear All,
> >
> > In the next couple of weeks I'll be working on the next release candidate
> > of 1.11.0. If you have any ongoing issues that you think will be nice to
> > have in 1.11.0, please set "Fix Version/s" accordingly. (If it is not
> > really targeted to 1.11.0, please, remove the related tag.)
> > If you think 1.11.0 cannot be released without a specific fix, please set
> > the mentioned tag accordingly and also add the jira as a blocker to
> > PARQUET-1434 .
> >
> > Thanks a lot,
> > Gabor
> >
>


Re: Working on 1.11.0 RC7

2019-10-18 Thread Driesprong, Fokko
Thanks for doing the release Gabor,

Is there a branch for 1.11.0? Please let me know.

Cheers, Fokko

Op vr 18 okt. 2019 om 09:55 schreef Gabor Szadovszky :

> Dear All,
>
> In the next couple of weeks I'll be working on the next release candidate
> of 1.11.0. If you have any ongoing issues that you think will be nice to
> have in 1.11.0, please set "Fix Version/s" accordingly. (If it is not
> really targeted to 1.11.0, please, remove the related tag.)
> If you think 1.11.0 cannot be released without a specific fix, please set
> the mentioned tag accordingly and also add the jira as a blocker to
> PARQUET-1434 .
>
> Thanks a lot,
> Gabor
>


Re: custom CompressionCodec support

2019-10-18 Thread Driesprong, Fokko
Hi Falak,

I was able to set the compression level in Spark using
spark.io.compression.zstd.level.

Cheers, Fokko

Op do 17 okt. 2019 om 20:53 schreef Radev, Martin :

> Hi Falak,
>
>
> I was one of the people who recently exposed this to Arrow but this is not
> part of the Parquet specification.
>
> In particular, any implementation for writing parquet files can decide
> whether to expose this or select a reasonable value internally.
>
>
> If you're using Arrow, you would have to read the documentation of the
> specified compressor. Arrow doesn't do checks for whether specified
> compression level is within the range of what's supported by the codec. For
> ZSTD, the range should be [1, 22].
>
> Let me know if you're using Arrow and I can check locally that there isn't
> by any chance a bug with propagating the value. At the moment there are
> only smoke tests that nothing crashes.
>
>
> Regards,
>
> Martin
> --
> *From:* Falak Kansal 
> *Sent:* Thursday, October 17, 2019 4:43:54 PM
> *To:* Driesprong, Fokko
> *Cc:* dev@parquet.apache.org
> *Subject:* Re: custom CompressionCodec support
>
> Hi Fokko,
>
> Thanks for replying, yes sure.
> The problem we are facing is that with parquet zstd we are not able to
> control the compression level, we tried setting different compression
> levels but it doesn't make any difference in the size. We tested/have made
> sure that we are getting the same compression level in
> *ZStandardCompressor
> *as we are setting in the configuration file. Are we missing something? How
> can we set a different compression level of zstd? Help would be
> appreciated.
>
> Thanks
> Falak
>
> On Thu, Oct 17, 2019 at 7:47 PM Driesprong, Fokko 
> wrote:
>
> > Hi Manik,
> >
> > The supported compression codecs that ship with Parquet are tested and
> > validated in the CI pipeline. Sometimes there are issues with
> compressors,
> > therefore they are not easily pluggable. Feel free to open up a PR to the
> > project if you believe if there are compressors missing, then we can
> have a
> > discussion.
> >
> > It is part of the Thrift definition:
> >
> https://github.com/apache/parquet-format/blob/37bdba0a18cff18da706a0d353c65e726c8edca6/src/main/thrift/parquet.thrift#L470-L478
> >
> > Hope this clarifies the design decision.
> >
> > Cheers, Fokko
> >
> > Op di 15 okt. 2019 om 11:52 schreef Manik Singla :
> >
> >> Hi
> >>
> >> Current java code is not open to use custom compressor.
> >> I believe mostly read/write is done by same team/company.  In that case,
> >> it
> >> would be beneficial to add this support that user can plug new
> compressor
> >> easily instead of doing local changes which will be prone to uses across
> >> version upgrades.
> >>
> >> Do you guys think it will be worth to add
> >>
> >> Regards
> >> Manik Singla
> >> +91-9996008893
> >> +91-9665639677
> >>
> >> "Life doesn't consist in holding good cards but playing those you hold
> >> well."
> >>
> >
>


Re: Updating parquet web site

2019-10-18 Thread Gabor Szadovszky
Hi Uwe,

parquet-site sounds good to me.

Cheers,
Gabor

On Fri, Oct 18, 2019 at 10:19 AM Uwe L. Korn  wrote:

> Hello Gabor,
>
> can we call this for clarity  https://github.com/apache/parquet-site ?
>
> Thanks
> Uwe
>
> On Fri, Oct 18, 2019, at 9:46 AM, Gabor Szadovszky wrote:
> > Dear All,
> >
> > There are some stuff on our web site that is ready for update (since a
> > while). To spin up the process it would be great if we could follow the
> > same git PR process we already have for our existing git repos. Jim has
> > already created PARQUET-1675
> >  for moving the
> > existing svn repo to git.
> >
> > If there are no objections I will create an infra ticket to move the svn
> > repo https://svn.apache.org/repos/asf/parquet to the new git repository
> > https://github.com/apache/parquet.
> >
> > Regards,
> > Gabor
> >
>


Re: Updating parquet web site

2019-10-18 Thread Uwe L. Korn
Hello Gabor,

can we call this for clarity  https://github.com/apache/parquet-site ?

Thanks
Uwe

On Fri, Oct 18, 2019, at 9:46 AM, Gabor Szadovszky wrote:
> Dear All,
> 
> There are some stuff on our web site that is ready for update (since a
> while). To spin up the process it would be great if we could follow the
> same git PR process we already have for our existing git repos. Jim has
> already created PARQUET-1675
>  for moving the
> existing svn repo to git.
> 
> If there are no objections I will create an infra ticket to move the svn
> repo https://svn.apache.org/repos/asf/parquet to the new git repository
> https://github.com/apache/parquet.
> 
> Regards,
> Gabor
>


Working on 1.11.0 RC7

2019-10-18 Thread Gabor Szadovszky
Dear All,

In the next couple of weeks I'll be working on the next release candidate
of 1.11.0. If you have any ongoing issues that you think will be nice to
have in 1.11.0, please set "Fix Version/s" accordingly. (If it is not
really targeted to 1.11.0, please, remove the related tag.)
If you think 1.11.0 cannot be released without a specific fix, please set
the mentioned tag accordingly and also add the jira as a blocker to
PARQUET-1434 .

Thanks a lot,
Gabor


[jira] [Resolved] (PARQUET-1570) Publish 1.11.0 to maven central

2019-10-18 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1570.
---
Resolution: Duplicate

Publishing artifacts to the maven repo is part of the releasing process. Will 
be done when 1.11.0 is released.

> Publish 1.11.0 to maven central
> ---
>
> Key: PARQUET-1570
> URL: https://issues.apache.org/jira/browse/PARQUET-1570
> Project: Parquet
>  Issue Type: Task
>Affects Versions: 1.11.0
>Reporter: Devin Smith
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Updating parquet web site

2019-10-18 Thread Gabor Szadovszky
Dear All,

There are some stuff on our web site that is ready for update (since a
while). To spin up the process it would be great if we could follow the
same git PR process we already have for our existing git repos. Jim has
already created PARQUET-1675
 for moving the
existing svn repo to git.

If there are no objections I will create an infra ticket to move the svn
repo https://svn.apache.org/repos/asf/parquet to the new git repository
https://github.com/apache/parquet.

Regards,
Gabor


[jira] [Assigned] (PARQUET-1675) Switch to git for website

2019-10-18 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1675:
-

Assignee: Gabor Szadovszky

> Switch to git for website
> -
>
> Key: PARQUET-1675
> URL: https://issues.apache.org/jira/browse/PARQUET-1675
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Jim Apple
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Using git for the website, rather than SVN, would allow website changes to be 
> proposed by non-committers as pull requests and reviewed in Github. For more, 
> see:
>  
> [https://blogs.apache.org/infra/entry/git_based_websites_available]
> [https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories]
> [https://www.apache.org/dev/project-site.html]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)