[jira] [Created] (PARQUET-1620) Schema creation from another schema will not be possible - deprecated

2019-07-10 Thread Werner Daehn (JIRA)
Werner Daehn created PARQUET-1620:
-

 Summary: Schema creation from another schema will not be possible 
- deprecated
 Key: PARQUET-1620
 URL: https://issues.apache.org/jira/browse/PARQUET-1620
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.11.0
Reporter: Werner Daehn


Imagine I have a current schema and want to create a projection schema from 
that. One option is the schema.Types.*Builder but the more direct version would 
be to clone the schema itself without children.

{{List l = new ArrayList<>();}}
{{ for (String c : childmappings.keySet()) {}}
{{  Mapping m = childmappings.get(c);}}
{{  l.add(m.getProjectionSchema());}}
{{ }}}
{{ GroupType gt = new GroupType(schema.getRepetition(), schema.getName(), 
schema.getOriginalType(), l);}}

 

The last line, the new GroupType(..) constructor is deprecated. We should use 
the version with the LogicalTypeAnnotation instead. Fine. But how do you get 
the LogicalTypeAnnotation  from an existing schema?

I feel you should not deprecate these methods and if, provide an extra method 
to create a Type column from a type column (column alone, without children. 
Else the projection would have all child columns).

 

Do you agree?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1183) AvroParquetWriter needs OutputFile based Builder

2018-01-05 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313260#comment-16313260
 ] 

Werner Daehn commented on PARQUET-1183:
---

https://github.com/apache/parquet-mr/pull/446

> AvroParquetWriter needs OutputFile based Builder
> 
>
> Key: PARQUET-1183
> URL: https://issues.apache.org/jira/browse/PARQUET-1183
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.9.1
>Reporter: Werner Daehn
> Fix For: 1.10.0
>
>
> The ParquetWriter got a new Builder(OutputFile). 
> But it cannot be used by the AvroParquetWriter as there is no matching 
> Builder/Constructor.
> Changes are quite simple:
> public static  Builder builder(OutputFile file) {
>   return new Builder(file)
> }
> and in the static Builder class below
> private Builder(OutputFile file) {
>   super(file);
> }
> Note: I am not good enough with builds, maven and git to create a pull 
> request yet. Sorry. Will try to get better here.
> See: https://issues.apache.org/jira/browse/PARQUET-1142



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type

2018-01-05 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313105#comment-16313105
 ] 

Werner Daehn edited comment on PARQUET-129 at 1/5/18 1:36 PM:
--

Okay, I believe I handled the schema conversion properly with all potential 
side effects. 
Would you or somebody else help adding the actual code for writing the data? 
That is way over my head at the moment...
https://github.com/apache/parquet-mr/pull/445 


(Had to defer from the exact logic I described above. Approach is the same but 
instead of bringing all reused tables to the root level, I leave them in the 
sub structures to be backward compatible. And the root/child is implicit to the 
ID column.)


was (Author: wdaehn):
Okay, I believe I handled the schema conversion properly with all potential 
side effects. 
Would you or somebody else help adding the actual code for writing the data? 
That is way over my head at the moment...
https://github.com/apache/parquet-mr/pull/445 


(Had to defer from the exact logic I described above. Approach is the same but 
instead of bringing all reused tables to the root level, I leave them in the 
sub structures to be backward compatible.)

> AvroParquetWriter can't save object who can have link to object of same type
> 
>
> Key: PARQUET-129
> URL: https://issues.apache.org/jira/browse/PARQUET-129
> Project: Parquet
>  Issue Type: Bug
> Environment: parquet version 1.5.0
>Reporter: Dmitriy
>
> When i try to write instance of UserTestOne created from following schema 
> {"namespace": "com.example.avro",
>  "type": "record",
>  "name": "UserTestOne",
>  "fields": [{"name": "name", "type": "string"},   {"name": "friend",  "type": 
> ["null", "UserTestOne"], "default":null} ]
> }
> I get java.lang.StackOverflowError 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type

2018-01-05 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313105#comment-16313105
 ] 

Werner Daehn edited comment on PARQUET-129 at 1/5/18 1:35 PM:
--

Okay, I believe I handled the schema conversion properly with all potential 
side effects. 
Would you or somebody else help adding the actual code for writing the data? 
That is way over my head at the moment...
https://github.com/apache/parquet-mr/pull/445 


(Had to defer from the exact logic I described above. Approach is the same but 
instead of bringing all reused tables to the root level, I leave them in the 
sub structures to be backward compatible.)


was (Author: wdaehn):
Okay, I believe I handled the schema conversion properly with all potential 
side effects. 
Would you or somebody else help adding the actual code for writing the data? 
That is way over my head at the moment...
https://github.com/apache/parquet-mr/pull/445 

> AvroParquetWriter can't save object who can have link to object of same type
> 
>
> Key: PARQUET-129
> URL: https://issues.apache.org/jira/browse/PARQUET-129
> Project: Parquet
>  Issue Type: Bug
> Environment: parquet version 1.5.0
>Reporter: Dmitriy
>
> When i try to write instance of UserTestOne created from following schema 
> {"namespace": "com.example.avro",
>  "type": "record",
>  "name": "UserTestOne",
>  "fields": [{"name": "name", "type": "string"},   {"name": "friend",  "type": 
> ["null", "UserTestOne"], "default":null} ]
> }
> I get java.lang.StackOverflowError 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type

2018-01-05 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313105#comment-16313105
 ] 

Werner Daehn commented on PARQUET-129:
--

Okay, I believe I handled the schema conversion properly with all potential 
side effects. 
Would you or somebody else help adding the actual code for writing the data? 
That is way over my head at the moment...
https://github.com/apache/parquet-mr/pull/445 

> AvroParquetWriter can't save object who can have link to object of same type
> 
>
> Key: PARQUET-129
> URL: https://issues.apache.org/jira/browse/PARQUET-129
> Project: Parquet
>  Issue Type: Bug
> Environment: parquet version 1.5.0
>Reporter: Dmitriy
>
> When i try to write instance of UserTestOne created from following schema 
> {"namespace": "com.example.avro",
>  "type": "record",
>  "name": "UserTestOne",
>  "fields": [{"name": "name", "type": "string"},   {"name": "friend",  "type": 
> ["null", "UserTestOne"], "default":null} ]
> }
> I get java.lang.StackOverflowError 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type

2018-01-04 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311854#comment-16311854
 ] 

Werner Daehn commented on PARQUET-129:
--

LogicalType.how is that used to "break the circular references"? I did not 
get that idea. An Avro logical type is just a synonym for a datatype - simple 
(like timestamp->long) or complex. But how is that related to circular 
references? Isn't that more for the flattening approach??

I will have a look into it, okay. I am just so spanking new to these internals, 
hence scared a bit. 

> AvroParquetWriter can't save object who can have link to object of same type
> 
>
> Key: PARQUET-129
> URL: https://issues.apache.org/jira/browse/PARQUET-129
> Project: Parquet
>  Issue Type: Bug
> Environment: parquet version 1.5.0
>Reporter: Dmitriy
>
> When i try to write instance of UserTestOne created from following schema 
> {"namespace": "com.example.avro",
>  "type": "record",
>  "name": "UserTestOne",
>  "fields": [{"name": "name", "type": "string"},   {"name": "friend",  "type": 
> ["null", "UserTestOne"], "default":null} ]
> }
> I get java.lang.StackOverflowError 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type

2018-01-04 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311732#comment-16311732
 ] 

Werner Daehn edited comment on PARQUET-129 at 1/4/18 5:48 PM:
--

Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I 
want to persist that in Parquet, what should I do? Failing at the Parquet write 
is too late, it would need to fail when creating the Avro message already. 
Imagine you have a Kafka server and people put all kind of Avro messages into 
it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it 
helps for 90% of the cases butsame argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How 
would you store that in a relational table?". Parquet wants to store all 
primitive fields of schemas and their nested schemas in single columns. Hence 
the problem redefinition makes sense, I hope. The answer would be then "By 
reusing the columns and adding a parent_key".

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, 
Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
Fritz
..Walter
Joe
Jil

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, 
create it.
# If that record definition is being reused by name (without actual definition 
of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id 
can hopefully be an Parquet internal pointer exposed as number hopefully.
# In the nested column store the pointer instead of the actual record.

The nice thing about that is, it would delay the decision of the max-depth to 
the Spark query time. Such a table is easy to read with Spark, supports the 
full flexibility available in tree structures including unbalanced trees, 
recursions in the data, .. everything. And you do not need to change Spark or 
Parquet itself, it is just the logic within the AvroWriter.

How does that sound?




was (Author: wdaehn):
Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I 
want to persist that in Parquet, what should I do? Failing at the Parquet write 
is too late, it would need to fail when creating the Avro message already. 
Imagine you have a Kafka server and people put all kind of Avro messages into 
it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it 
helps for 90% of the cases butsame argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How 
would you store that in a relational table?". Parquet wants to store all 
primitive fields of schemas and their nested schemas in single columns. Hence 
the problem redefinition makes sense, I hope. The answer would be then "By 
reusing the columns and adding a parent_key.

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, 
Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
Fritz
..Walter
Joe
Jil

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, 
create it.
# If that record definition is being reused by name (without actual definition 
of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id 
can hopefully be an Parquet internal pointer exposed as number hopefully.

The nice thing about that is, it would delay the decision of the max-depth to 
the Spark query time. Such a table is easy to read with Spark, supports the 
full flexibility available in tree structures including unbalanced trees, 
recursions in the data, .. everything. And you do not need to change Spark or 
Parquet itself, it is just the logic within the AvroWriter.

How does that sound?



> AvroParquetWriter can't save object who can have link to object of same type
> 
>
> Key: PARQUET-129
> URL: https://issues.apache.org/jira/browse/PARQUET-129
> Project: Parquet
>  Issue Type: Bug
> Environment: parquet version 1.5.0
>Reporter: Dmitriy
>
> When i try to write instance of UserTestOne 

[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type

2018-01-04 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311732#comment-16311732
 ] 

Werner Daehn edited comment on PARQUET-129 at 1/4/18 5:47 PM:
--

Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I 
want to persist that in Parquet, what should I do? Failing at the Parquet write 
is too late, it would need to fail when creating the Avro message already. 
Imagine you have a Kafka server and people put all kind of Avro messages into 
it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it 
helps for 90% of the cases butsame argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How 
would you store that in a relational table?". Parquet wants to store all 
primitive fields of schemas and their nested schemas in single columns. Hence 
the problem redefinition makes sense, I hope. The answer would be then "By 
reusing the columns and adding a parent_key.

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, 
Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
Fritz
..Walter
Joe
Jil

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, 
create it.
# If that record definition is being reused by name (without actual definition 
of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id 
can hopefully be an Parquet internal pointer exposed as number hopefully.

The nice thing about that is, it would delay the decision of the max-depth to 
the Spark query time. Such a table is easy to read with Spark, supports the 
full flexibility available in tree structures including unbalanced trees, 
recursions in the data, .. everything. And you do not need to change Spark or 
Parquet itself, it is just the logic within the AvroWriter.

How does that sound?




was (Author: wdaehn):
Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I 
want to persist that in Parquet, what should I do? Failing at the Parquet write 
is too late, it would need to fail when creating the Avro message already. 
Imagine you have a Kafka server and people put all kind of Avro messages into 
it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it 
helps for 90% of the cases butsame argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How 
would you store that in a relational table?". Parquet wants to store all 
primitive fields of schemas and their nested schemas in single columns. Hence 
the problem redefinition makes sense, I hope. The answer would be then "By 
reusing the columns and adding a parent_key.

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, 
Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
{{Fritz
..Walter
Joe
Jil}}

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, 
create it.
# If that record definition is being reused by name (without actual definition 
of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id 
can hopefully be an Parquet internal pointer exposed as number hopefully.

The nice thing about that is, it would delay the decision of the max-depth to 
the Spark query time. Such a table is easy to read with Spark, supports the 
full flexibility available in tree structures including unbalanced trees, 
recursions in the data, .. everything. And you do not need to change Spark or 
Parquet itself, it is just the logic within the AvroWriter.

How does that sound?



> AvroParquetWriter can't save object who can have link to object of same type
> 
>
> Key: PARQUET-129
> URL: https://issues.apache.org/jira/browse/PARQUET-129
> Project: Parquet
>  Issue Type: Bug
> Environment: parquet version 1.5.0
>Reporter: Dmitriy
>
> When i try to write instance of UserTestOne created from following schema 
> {"namespace": "com.example.avro",
> 

[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type

2018-01-04 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311732#comment-16311732
 ] 

Werner Daehn edited comment on PARQUET-129 at 1/4/18 5:46 PM:
--

Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I 
want to persist that in Parquet, what should I do? Failing at the Parquet write 
is too late, it would need to fail when creating the Avro message already. 
Imagine you have a Kafka server and people put all kind of Avro messages into 
it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it 
helps for 90% of the cases butsame argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How 
would you store that in a relational table?". Parquet wants to store all 
primitive fields of schemas and their nested schemas in single columns. Hence 
the problem redefinition makes sense, I hope. The answer would be then "By 
reusing the columns and adding a parent_key.

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, 
Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
{{
Fritz
..Walter
Joe
Jil
}}

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, 
create it.
# If that record definition is being reused by name (without actual definition 
of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id 
can hopefully be an Parquet internal pointer exposed as number hopefully.

The nice thing about that is, it would delay the decision of the max-depth to 
the Spark query time. Such a table is easy to read with Spark, supports the 
full flexibility available in tree structures including unbalanced trees, 
recursions in the data, .. everything. And you do not need to change Spark or 
Parquet itself, it is just the logic within the AvroWriter.

How does that sound?




was (Author: wdaehn):
Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I 
want to persist that in Parquet, what should I do? Failing at the Parquet write 
is too late, it would need to fail when creating the Avro message already. 
Imagine you have a Kafka server and people put all kind of Avro messages into 
it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it 
helps for 90% of the cases butsame argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How 
would you store that in a relational table?". Parquet wants to store all 
primitive fields of schemas and their nested schemas in single columns. Hence 
the problem redefinition makes sense, I hope. The answer would be then "By 
reusing the columns and adding a parent_key.

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, 
Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
{{
Fritz
  Walter
Joe
Jil
}}

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, 
create it.
# If that record definition is being reused by name (without actual definition 
of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id 
can hopefully be an Parquet internal pointer exposed as number hopefully.

The nice thing about that is, it would delay the decision of the max-depth to 
the Spark query time. Such a table is easy to read with Spark, supports the 
full flexibility available in tree structures including unbalanced trees, 
recursions in the data, .. everything. And you do not need to change Spark or 
Parquet itself, it is just the logic within the AvroWriter.

How does that sound?



> AvroParquetWriter can't save object who can have link to object of same type
> 
>
> Key: PARQUET-129
> URL: https://issues.apache.org/jira/browse/PARQUET-129
> Project: Parquet
>  Issue Type: Bug
> Environment: parquet version 1.5.0
>Reporter: Dmitriy
>
> When i try to write instance of UserTestOne created from following schema 
> {"namespace": 

[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type

2018-01-04 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311732#comment-16311732
 ] 

Werner Daehn edited comment on PARQUET-129 at 1/4/18 5:47 PM:
--

Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I 
want to persist that in Parquet, what should I do? Failing at the Parquet write 
is too late, it would need to fail when creating the Avro message already. 
Imagine you have a Kafka server and people put all kind of Avro messages into 
it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it 
helps for 90% of the cases butsame argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How 
would you store that in a relational table?". Parquet wants to store all 
primitive fields of schemas and their nested schemas in single columns. Hence 
the problem redefinition makes sense, I hope. The answer would be then "By 
reusing the columns and adding a parent_key.

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, 
Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
{{Fritz
..Walter
Joe
Jil}}

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, 
create it.
# If that record definition is being reused by name (without actual definition 
of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id 
can hopefully be an Parquet internal pointer exposed as number hopefully.

The nice thing about that is, it would delay the decision of the max-depth to 
the Spark query time. Such a table is easy to read with Spark, supports the 
full flexibility available in tree structures including unbalanced trees, 
recursions in the data, .. everything. And you do not need to change Spark or 
Parquet itself, it is just the logic within the AvroWriter.

How does that sound?




was (Author: wdaehn):
Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I 
want to persist that in Parquet, what should I do? Failing at the Parquet write 
is too late, it would need to fail when creating the Avro message already. 
Imagine you have a Kafka server and people put all kind of Avro messages into 
it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it 
helps for 90% of the cases butsame argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How 
would you store that in a relational table?". Parquet wants to store all 
primitive fields of schemas and their nested schemas in single columns. Hence 
the problem redefinition makes sense, I hope. The answer would be then "By 
reusing the columns and adding a parent_key.

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, 
Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
{{
Fritz
..Walter
Joe
Jil
}}

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, 
create it.
# If that record definition is being reused by name (without actual definition 
of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id 
can hopefully be an Parquet internal pointer exposed as number hopefully.

The nice thing about that is, it would delay the decision of the max-depth to 
the Spark query time. Such a table is easy to read with Spark, supports the 
full flexibility available in tree structures including unbalanced trees, 
recursions in the data, .. everything. And you do not need to change Spark or 
Parquet itself, it is just the logic within the AvroWriter.

How does that sound?



> AvroParquetWriter can't save object who can have link to object of same type
> 
>
> Key: PARQUET-129
> URL: https://issues.apache.org/jira/browse/PARQUET-129
> Project: Parquet
>  Issue Type: Bug
> Environment: parquet version 1.5.0
>Reporter: Dmitriy
>
> When i try to write instance of UserTestOne created from following schema 
> {"namespace": 

[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type

2018-01-04 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311732#comment-16311732
 ] 

Werner Daehn edited comment on PARQUET-129 at 1/4/18 5:46 PM:
--

Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I 
want to persist that in Parquet, what should I do? Failing at the Parquet write 
is too late, it would need to fail when creating the Avro message already. 
Imagine you have a Kafka server and people put all kind of Avro messages into 
it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it 
helps for 90% of the cases butsame argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How 
would you store that in a relational table?". Parquet wants to store all 
primitive fields of schemas and their nested schemas in single columns. Hence 
the problem redefinition makes sense, I hope. The answer would be then "By 
reusing the columns and adding a parent_key.

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, 
Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
{{
Fritz
  Walter
Joe
Jil
}}

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, 
create it.
# If that record definition is being reused by name (without actual definition 
of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id 
can hopefully be an Parquet internal pointer exposed as number hopefully.

The nice thing about that is, it would delay the decision of the max-depth to 
the Spark query time. Such a table is easy to read with Spark, supports the 
full flexibility available in tree structures including unbalanced trees, 
recursions in the data, .. everything. And you do not need to change Spark or 
Parquet itself, it is just the logic within the AvroWriter.

How does that sound?




was (Author: wdaehn):
Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I 
want to persist that in Parquet, what should I do? Failing at the Parquet write 
is too late, it would need to fail when creating the Avro message already. 
Imagine you have a Kafka server and people put all kind of Avro messages into 
it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it 
helps for 90% of the cases butsame argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How 
would you store that in a relational table?". Parquet wants to store all 
primitive fields of schemas and their nested schemas in single columns. Hence 
the problem redefinition makes sense, I hope. The answer would be then "By 
reusing the columns and adding a parent_key.

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, 
Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
{{
Fritz
+--Walter
+--+--Joe
+--+--Jil
}}

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, 
create it.
# If that record definition is being reused by name (without actual definition 
of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id 
can hopefully be an Parquet internal pointer exposed as number hopefully.

The nice thing about that is, it would delay the decision of the max-depth to 
the Spark query time. Such a table is easy to read with Spark, supports the 
full flexibility available in tree structures including unbalanced trees, 
recursions in the data, .. everything. And you do not need to change Spark or 
Parquet itself, it is just the logic within the AvroWriter.

How does that sound?



> AvroParquetWriter can't save object who can have link to object of same type
> 
>
> Key: PARQUET-129
> URL: https://issues.apache.org/jira/browse/PARQUET-129
> Project: Parquet
>  Issue Type: Bug
> Environment: parquet version 1.5.0
>Reporter: Dmitriy
>
> When i try to write instance of UserTestOne created from following schema 
> {"namespace": 

[jira] [Commented] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type

2018-01-04 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311732#comment-16311732
 ] 

Werner Daehn commented on PARQUET-129:
--

Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I 
want to persist that in Parquet, what should I do? Failing at the Parquet write 
is too late, it would need to fail when creating the Avro message already. 
Imagine you have a Kafka server and people put all kind of Avro messages into 
it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it 
helps for 90% of the cases butsame argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How 
would you store that in a relational table?". Parquet wants to store all 
primitive fields of schemas and their nested schemas in single columns. Hence 
the problem redefinition makes sense, I hope. The answer would be then "By 
reusing the columns and adding a parent_key.

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, 
Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
{{
Fritz
+--Walter
+--+--Joe
+--+--Jil
}}

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, 
create it.
# If that record definition is being reused by name (without actual definition 
of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id 
can hopefully be an Parquet internal pointer exposed as number hopefully.

The nice thing about that is, it would delay the decision of the max-depth to 
the Spark query time. Such a table is easy to read with Spark, supports the 
full flexibility available in tree structures including unbalanced trees, 
recursions in the data, .. everything. And you do not need to change Spark or 
Parquet itself, it is just the logic within the AvroWriter.

How does that sound?



> AvroParquetWriter can't save object who can have link to object of same type
> 
>
> Key: PARQUET-129
> URL: https://issues.apache.org/jira/browse/PARQUET-129
> Project: Parquet
>  Issue Type: Bug
> Environment: parquet version 1.5.0
>Reporter: Dmitriy
>
> When i try to write instance of UserTestOne created from following schema 
> {"namespace": "com.example.avro",
>  "type": "record",
>  "name": "UserTestOne",
>  "fields": [{"name": "name", "type": "string"},   {"name": "friend",  "type": 
> ["null", "UserTestOne"], "default":null} ]
> }
> I get java.lang.StackOverflowError 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1184) Make DelegatingPositionOutputStream a concrete class

2018-01-04 Thread Werner Daehn (JIRA)
Werner Daehn created PARQUET-1184:
-

 Summary: Make DelegatingPositionOutputStream a concrete class
 Key: PARQUET-1184
 URL: https://issues.apache.org/jira/browse/PARQUET-1184
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-avro
Affects Versions: 1.9.1
Reporter: Werner Daehn
 Fix For: 1.10.0


I fail to understand why this is an abstract class. In my example I want to 
write the Parquet file to a java.io.FileOutputStream, hence have to extend the 
DelegatingPositionOutputStream and store the pos information, increase it in 
all write(..) methods and return its value in getPos().

Doable of course, but useful? Previously yes but now with the OutputFile 
changes to decouple it from Hadoop more, I believe no.

related to: https://issues.apache.org/jira/browse/PARQUET-1142



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1183) AvroParquetWriter needs OutputFile based Builder

2018-01-04 Thread Werner Daehn (JIRA)
Werner Daehn created PARQUET-1183:
-

 Summary: AvroParquetWriter needs OutputFile based Builder
 Key: PARQUET-1183
 URL: https://issues.apache.org/jira/browse/PARQUET-1183
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-avro
Affects Versions: 1.9.1
Reporter: Werner Daehn
 Fix For: 1.10.0


The ParquetWriter got a new Builder(OutputFile). 
But it cannot be used by the AvroParquetWriter as there is no matching 
Builder/Constructor.

Changes are quite simple:

public static  Builder builder(OutputFile file) {
  return new Builder(file)
}

and in the static Builder class below

private Builder(OutputFile file) {
  super(file);
}

Note: I am not good enough with builds, maven and git to create a pull request 
yet. Sorry. Will try to get better here.

See: https://issues.apache.org/jira/browse/PARQUET-1142



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-1142) Avoid leaking Hadoop API to downstream libraries

2018-01-02 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308295#comment-16308295
 ] 

Werner Daehn edited comment on PARQUET-1142 at 1/2/18 4:24 PM:
---

Thanks Ryan, I believe this is one of the most important areas where Parquet 
can get better.

Do you happen to have some sample code on how to use that? My current approach 
is to extend ParquetWriter including the inner Builder class extending 
ParquetWriter.Builder.
Where I am stuck is the Builder.WriteSupport with its Hadoop Configuration 
parameter. It has to be implemented but you talk about ParquetReadOptions?
And my class needs a constructor but all visible super()'s are Hadoop based. 
Any hints would be greatly appreciated.

Background:

In my world there is no such thing as Hadoop but many different kinds of file 
storage options, HDFS being one of those. Therefore I would be in favor of 
removing all Hadoop related stuff from Parquet and rather have a 
java.io.OutputStream (=org.apache.parquet.io.OutputFile) which then can be used 
to store files in different locations, HDFS full client, HDFS web client, S3, 
local file system.

My ultimate goal for the moment is to have a AvroParquetWriter without Hadoop 
to save to a local file system in a streaming way. This would then be used to 
consume the Kafka data of the last 24 hours and write one large Parquet file.
Next step would be a matching reader that does pushdown simple filter criteria. 
Best would be one that allows even primary key lookups in a reasonable amount 
of time.


was (Author: wdaehn):
Thanks Ryan, I believe this is one of the most important areas where Parquet 
can get better.

Do you happen to have some sample code on how to use that? My current approach 
is to extend ParquetWriter including the an inner Builder class extending 
ParquetWriter.Builder.
Where I am stuck is the Builder.WriteSupport with its Hadoop Configuration 
parameter. It has to be implemented but you talk about ParquetReadOptions?
And my class needs a constructor but all visible super()'s are Hadoop based. 
Any hints would be greatly appreciated.

Background:

In my world there is no such thing as Hadoop but many different kinds of file 
storage options, HDFS being one of those. Therefore I would be in favor of 
removing all Hadoop related stuff from Parquet and rather have a 
java.io.OutputStream (=org.apache.parquet.io.OutputFile) which then can be used 
to store files in different locations, HDFS full client, HDFS web client, S3, 
local file system.

My ultimate goal for the moment is to have a AvroParquetWriter without Hadoop 
to save to a local file system in a streaming way. This would then be used to 
consume the Kafka data of the last 24 hours and write one large Parquet file.
Next step would be a matching reader that does pushdown simple filter criteria. 
Best would be one that allows even primary key lookups in a reasonable amount 
of time.

> Avoid leaking Hadoop API to downstream libraries
> 
>
> Key: PARQUET-1142
> URL: https://issues.apache.org/jira/browse/PARQUET-1142
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.10.0
>
>
> Parquet currently leaks the Hadoop API by requiring callers to pass {{Path}} 
> and {{Configuration}} instances, and by using Hadoop codecs. {{InputFile}} 
> and {{SeekableInputStream}} add alternatives to Hadoop classes in some parts 
> of the read path, but this needs to be extended to the write path and to 
> avoid passing options through {{Configuration}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1142) Avoid leaking Hadoop API to downstream libraries

2018-01-02 Thread Werner Daehn (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308295#comment-16308295
 ] 

Werner Daehn commented on PARQUET-1142:
---

Thanks Ryan, I believe this is one of the most important areas where Parquet 
can get better.

Do you happen to have some sample code on how to use that? My current approach 
is to extend ParquetWriter including the an inner Builder class extending 
ParquetWriter.Builder.
Where I am stuck is the Builder.WriteSupport with its Hadoop Configuration 
parameter. It has to be implemented but you talk about ParquetReadOptions?
And my class needs a constructor but all visible super()'s are Hadoop based. 
Any hints would be greatly appreciated.

Background:

In my world there is no such thing as Hadoop but many different kinds of file 
storage options, HDFS being one of those. Therefore I would be in favor of 
removing all Hadoop related stuff from Parquet and rather have a 
java.io.OutputStream (=org.apache.parquet.io.OutputFile) which then can be used 
to store files in different locations, HDFS full client, HDFS web client, S3, 
local file system.

My ultimate goal for the moment is to have a AvroParquetWriter without Hadoop 
to save to a local file system in a streaming way. This would then be used to 
consume the Kafka data of the last 24 hours and write one large Parquet file.
Next step would be a matching reader that does pushdown simple filter criteria. 
Best would be one that allows even primary key lookups in a reasonable amount 
of time.

> Avoid leaking Hadoop API to downstream libraries
> 
>
> Key: PARQUET-1142
> URL: https://issues.apache.org/jira/browse/PARQUET-1142
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.10.0
>
>
> Parquet currently leaks the Hadoop API by requiring callers to pass {{Path}} 
> and {{Configuration}} instances, and by using Hadoop codecs. {{InputFile}} 
> and {{SeekableInputStream}} add alternatives to Hadoop classes in some parts 
> of the read path, but this needs to be extended to the write path and to 
> avoid passing options through {{Configuration}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)