[jira] [Created] (PARQUET-1620) Schema creation from another schema will not be possible - deprecated
Werner Daehn created PARQUET-1620: - Summary: Schema creation from another schema will not be possible - deprecated Key: PARQUET-1620 URL: https://issues.apache.org/jira/browse/PARQUET-1620 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.11.0 Reporter: Werner Daehn Imagine I have a current schema and want to create a projection schema from that. One option is the schema.Types.*Builder but the more direct version would be to clone the schema itself without children. {{List l = new ArrayList<>();}} {{ for (String c : childmappings.keySet()) {}} {{ Mapping m = childmappings.get(c);}} {{ l.add(m.getProjectionSchema());}} {{ }}} {{ GroupType gt = new GroupType(schema.getRepetition(), schema.getName(), schema.getOriginalType(), l);}} The last line, the new GroupType(..) constructor is deprecated. We should use the version with the LogicalTypeAnnotation instead. Fine. But how do you get the LogicalTypeAnnotation from an existing schema? I feel you should not deprecate these methods and if, provide an extra method to create a Type column from a type column (column alone, without children. Else the projection would have all child columns). Do you agree? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1183) AvroParquetWriter needs OutputFile based Builder
[ https://issues.apache.org/jira/browse/PARQUET-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313260#comment-16313260 ] Werner Daehn commented on PARQUET-1183: --- https://github.com/apache/parquet-mr/pull/446 > AvroParquetWriter needs OutputFile based Builder > > > Key: PARQUET-1183 > URL: https://issues.apache.org/jira/browse/PARQUET-1183 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.9.1 >Reporter: Werner Daehn > Fix For: 1.10.0 > > > The ParquetWriter got a new Builder(OutputFile). > But it cannot be used by the AvroParquetWriter as there is no matching > Builder/Constructor. > Changes are quite simple: > public static Builder builder(OutputFile file) { > return new Builder(file) > } > and in the static Builder class below > private Builder(OutputFile file) { > super(file); > } > Note: I am not good enough with builds, maven and git to create a pull > request yet. Sorry. Will try to get better here. > See: https://issues.apache.org/jira/browse/PARQUET-1142 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type
[ https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313105#comment-16313105 ] Werner Daehn edited comment on PARQUET-129 at 1/5/18 1:36 PM: -- Okay, I believe I handled the schema conversion properly with all potential side effects. Would you or somebody else help adding the actual code for writing the data? That is way over my head at the moment... https://github.com/apache/parquet-mr/pull/445 (Had to defer from the exact logic I described above. Approach is the same but instead of bringing all reused tables to the root level, I leave them in the sub structures to be backward compatible. And the root/child is implicit to the ID column.) was (Author: wdaehn): Okay, I believe I handled the schema conversion properly with all potential side effects. Would you or somebody else help adding the actual code for writing the data? That is way over my head at the moment... https://github.com/apache/parquet-mr/pull/445 (Had to defer from the exact logic I described above. Approach is the same but instead of bringing all reused tables to the root level, I leave them in the sub structures to be backward compatible.) > AvroParquetWriter can't save object who can have link to object of same type > > > Key: PARQUET-129 > URL: https://issues.apache.org/jira/browse/PARQUET-129 > Project: Parquet > Issue Type: Bug > Environment: parquet version 1.5.0 >Reporter: Dmitriy > > When i try to write instance of UserTestOne created from following schema > {"namespace": "com.example.avro", > "type": "record", > "name": "UserTestOne", > "fields": [{"name": "name", "type": "string"}, {"name": "friend", "type": > ["null", "UserTestOne"], "default":null} ] > } > I get java.lang.StackOverflowError -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type
[ https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313105#comment-16313105 ] Werner Daehn edited comment on PARQUET-129 at 1/5/18 1:35 PM: -- Okay, I believe I handled the schema conversion properly with all potential side effects. Would you or somebody else help adding the actual code for writing the data? That is way over my head at the moment... https://github.com/apache/parquet-mr/pull/445 (Had to defer from the exact logic I described above. Approach is the same but instead of bringing all reused tables to the root level, I leave them in the sub structures to be backward compatible.) was (Author: wdaehn): Okay, I believe I handled the schema conversion properly with all potential side effects. Would you or somebody else help adding the actual code for writing the data? That is way over my head at the moment... https://github.com/apache/parquet-mr/pull/445 > AvroParquetWriter can't save object who can have link to object of same type > > > Key: PARQUET-129 > URL: https://issues.apache.org/jira/browse/PARQUET-129 > Project: Parquet > Issue Type: Bug > Environment: parquet version 1.5.0 >Reporter: Dmitriy > > When i try to write instance of UserTestOne created from following schema > {"namespace": "com.example.avro", > "type": "record", > "name": "UserTestOne", > "fields": [{"name": "name", "type": "string"}, {"name": "friend", "type": > ["null", "UserTestOne"], "default":null} ] > } > I get java.lang.StackOverflowError -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type
[ https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313105#comment-16313105 ] Werner Daehn commented on PARQUET-129: -- Okay, I believe I handled the schema conversion properly with all potential side effects. Would you or somebody else help adding the actual code for writing the data? That is way over my head at the moment... https://github.com/apache/parquet-mr/pull/445 > AvroParquetWriter can't save object who can have link to object of same type > > > Key: PARQUET-129 > URL: https://issues.apache.org/jira/browse/PARQUET-129 > Project: Parquet > Issue Type: Bug > Environment: parquet version 1.5.0 >Reporter: Dmitriy > > When i try to write instance of UserTestOne created from following schema > {"namespace": "com.example.avro", > "type": "record", > "name": "UserTestOne", > "fields": [{"name": "name", "type": "string"}, {"name": "friend", "type": > ["null", "UserTestOne"], "default":null} ] > } > I get java.lang.StackOverflowError -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type
[ https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311854#comment-16311854 ] Werner Daehn commented on PARQUET-129: -- LogicalType.how is that used to "break the circular references"? I did not get that idea. An Avro logical type is just a synonym for a datatype - simple (like timestamp->long) or complex. But how is that related to circular references? Isn't that more for the flattening approach?? I will have a look into it, okay. I am just so spanking new to these internals, hence scared a bit. > AvroParquetWriter can't save object who can have link to object of same type > > > Key: PARQUET-129 > URL: https://issues.apache.org/jira/browse/PARQUET-129 > Project: Parquet > Issue Type: Bug > Environment: parquet version 1.5.0 >Reporter: Dmitriy > > When i try to write instance of UserTestOne created from following schema > {"namespace": "com.example.avro", > "type": "record", > "name": "UserTestOne", > "fields": [{"name": "name", "type": "string"}, {"name": "friend", "type": > ["null", "UserTestOne"], "default":null} ] > } > I get java.lang.StackOverflowError -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type
[ https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311732#comment-16311732 ] Werner Daehn edited comment on PARQUET-129 at 1/4/18 5:48 PM: -- Gee, I am hitting the same old issues. Feel like stalking Ryan ;-) Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet. The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases butsame argument as before. Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key". Example with above schema: Input data shall be {"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]} or rendered as a tree: Fritz ..Walter Joe Jil Converting that to a Parquet structure: ||id||name||friend||root|| |1|Fritz|2|true| |2|Walter|3|false| |3|Joe|null|false| |4|Walter|5|false| |5|Jil|null|false| In other words: # Whenever you find a record definition in the Avro Schema the first time, create it. # If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure. # If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully. # In the nested column store the pointer instead of the actual record. The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter. How does that sound? was (Author: wdaehn): Gee, I am hitting the same old issues. Feel like stalking Ryan ;-) Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet. The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases butsame argument as before. Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key. Example with above schema: Input data shall be {"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]} or rendered as a tree: Fritz ..Walter Joe Jil Converting that to a Parquet structure: ||id||name||friend||root|| |1|Fritz|2|true| |2|Walter|3|false| |3|Joe|null|false| |4|Walter|5|false| |5|Jil|null|false| In other words: # Whenever you find a record definition in the Avro Schema the first time, create it. # If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure. # If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully. The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter. How does that sound? > AvroParquetWriter can't save object who can have link to object of same type > > > Key: PARQUET-129 > URL: https://issues.apache.org/jira/browse/PARQUET-129 > Project: Parquet > Issue Type: Bug > Environment: parquet version 1.5.0 >Reporter: Dmitriy > > When i try to write instance of UserTestOne
[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type
[ https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311732#comment-16311732 ] Werner Daehn edited comment on PARQUET-129 at 1/4/18 5:47 PM: -- Gee, I am hitting the same old issues. Feel like stalking Ryan ;-) Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet. The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases butsame argument as before. Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key. Example with above schema: Input data shall be {"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]} or rendered as a tree: Fritz ..Walter Joe Jil Converting that to a Parquet structure: ||id||name||friend||root|| |1|Fritz|2|true| |2|Walter|3|false| |3|Joe|null|false| |4|Walter|5|false| |5|Jil|null|false| In other words: # Whenever you find a record definition in the Avro Schema the first time, create it. # If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure. # If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully. The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter. How does that sound? was (Author: wdaehn): Gee, I am hitting the same old issues. Feel like stalking Ryan ;-) Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet. The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases butsame argument as before. Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key. Example with above schema: Input data shall be {"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]} or rendered as a tree: {{Fritz ..Walter Joe Jil}} Converting that to a Parquet structure: ||id||name||friend||root|| |1|Fritz|2|true| |2|Walter|3|false| |3|Joe|null|false| |4|Walter|5|false| |5|Jil|null|false| In other words: # Whenever you find a record definition in the Avro Schema the first time, create it. # If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure. # If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully. The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter. How does that sound? > AvroParquetWriter can't save object who can have link to object of same type > > > Key: PARQUET-129 > URL: https://issues.apache.org/jira/browse/PARQUET-129 > Project: Parquet > Issue Type: Bug > Environment: parquet version 1.5.0 >Reporter: Dmitriy > > When i try to write instance of UserTestOne created from following schema > {"namespace": "com.example.avro", >
[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type
[ https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311732#comment-16311732 ] Werner Daehn edited comment on PARQUET-129 at 1/4/18 5:46 PM: -- Gee, I am hitting the same old issues. Feel like stalking Ryan ;-) Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet. The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases butsame argument as before. Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key. Example with above schema: Input data shall be {"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]} or rendered as a tree: {{ Fritz ..Walter Joe Jil }} Converting that to a Parquet structure: ||id||name||friend||root|| |1|Fritz|2|true| |2|Walter|3|false| |3|Joe|null|false| |4|Walter|5|false| |5|Jil|null|false| In other words: # Whenever you find a record definition in the Avro Schema the first time, create it. # If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure. # If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully. The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter. How does that sound? was (Author: wdaehn): Gee, I am hitting the same old issues. Feel like stalking Ryan ;-) Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet. The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases butsame argument as before. Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key. Example with above schema: Input data shall be {"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]} or rendered as a tree: {{ Fritz Walter Joe Jil }} Converting that to a Parquet structure: ||id||name||friend||root|| |1|Fritz|2|true| |2|Walter|3|false| |3|Joe|null|false| |4|Walter|5|false| |5|Jil|null|false| In other words: # Whenever you find a record definition in the Avro Schema the first time, create it. # If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure. # If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully. The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter. How does that sound? > AvroParquetWriter can't save object who can have link to object of same type > > > Key: PARQUET-129 > URL: https://issues.apache.org/jira/browse/PARQUET-129 > Project: Parquet > Issue Type: Bug > Environment: parquet version 1.5.0 >Reporter: Dmitriy > > When i try to write instance of UserTestOne created from following schema > {"namespace":
[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type
[ https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311732#comment-16311732 ] Werner Daehn edited comment on PARQUET-129 at 1/4/18 5:47 PM: -- Gee, I am hitting the same old issues. Feel like stalking Ryan ;-) Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet. The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases butsame argument as before. Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key. Example with above schema: Input data shall be {"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]} or rendered as a tree: {{Fritz ..Walter Joe Jil}} Converting that to a Parquet structure: ||id||name||friend||root|| |1|Fritz|2|true| |2|Walter|3|false| |3|Joe|null|false| |4|Walter|5|false| |5|Jil|null|false| In other words: # Whenever you find a record definition in the Avro Schema the first time, create it. # If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure. # If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully. The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter. How does that sound? was (Author: wdaehn): Gee, I am hitting the same old issues. Feel like stalking Ryan ;-) Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet. The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases butsame argument as before. Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key. Example with above schema: Input data shall be {"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]} or rendered as a tree: {{ Fritz ..Walter Joe Jil }} Converting that to a Parquet structure: ||id||name||friend||root|| |1|Fritz|2|true| |2|Walter|3|false| |3|Joe|null|false| |4|Walter|5|false| |5|Jil|null|false| In other words: # Whenever you find a record definition in the Avro Schema the first time, create it. # If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure. # If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully. The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter. How does that sound? > AvroParquetWriter can't save object who can have link to object of same type > > > Key: PARQUET-129 > URL: https://issues.apache.org/jira/browse/PARQUET-129 > Project: Parquet > Issue Type: Bug > Environment: parquet version 1.5.0 >Reporter: Dmitriy > > When i try to write instance of UserTestOne created from following schema > {"namespace":
[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type
[ https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311732#comment-16311732 ] Werner Daehn edited comment on PARQUET-129 at 1/4/18 5:46 PM: -- Gee, I am hitting the same old issues. Feel like stalking Ryan ;-) Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet. The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases butsame argument as before. Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key. Example with above schema: Input data shall be {"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]} or rendered as a tree: {{ Fritz Walter Joe Jil }} Converting that to a Parquet structure: ||id||name||friend||root|| |1|Fritz|2|true| |2|Walter|3|false| |3|Joe|null|false| |4|Walter|5|false| |5|Jil|null|false| In other words: # Whenever you find a record definition in the Avro Schema the first time, create it. # If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure. # If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully. The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter. How does that sound? was (Author: wdaehn): Gee, I am hitting the same old issues. Feel like stalking Ryan ;-) Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet. The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases butsame argument as before. Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key. Example with above schema: Input data shall be {"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]} or rendered as a tree: {{ Fritz +--Walter +--+--Joe +--+--Jil }} Converting that to a Parquet structure: ||id||name||friend||root|| |1|Fritz|2|true| |2|Walter|3|false| |3|Joe|null|false| |4|Walter|5|false| |5|Jil|null|false| In other words: # Whenever you find a record definition in the Avro Schema the first time, create it. # If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure. # If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully. The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter. How does that sound? > AvroParquetWriter can't save object who can have link to object of same type > > > Key: PARQUET-129 > URL: https://issues.apache.org/jira/browse/PARQUET-129 > Project: Parquet > Issue Type: Bug > Environment: parquet version 1.5.0 >Reporter: Dmitriy > > When i try to write instance of UserTestOne created from following schema > {"namespace":
[jira] [Commented] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type
[ https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311732#comment-16311732 ] Werner Daehn commented on PARQUET-129: -- Gee, I am hitting the same old issues. Feel like stalking Ryan ;-) Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet. The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases butsame argument as before. Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key. Example with above schema: Input data shall be {"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]} or rendered as a tree: {{ Fritz +--Walter +--+--Joe +--+--Jil }} Converting that to a Parquet structure: ||id||name||friend||root|| |1|Fritz|2|true| |2|Walter|3|false| |3|Joe|null|false| |4|Walter|5|false| |5|Jil|null|false| In other words: # Whenever you find a record definition in the Avro Schema the first time, create it. # If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure. # If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully. The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter. How does that sound? > AvroParquetWriter can't save object who can have link to object of same type > > > Key: PARQUET-129 > URL: https://issues.apache.org/jira/browse/PARQUET-129 > Project: Parquet > Issue Type: Bug > Environment: parquet version 1.5.0 >Reporter: Dmitriy > > When i try to write instance of UserTestOne created from following schema > {"namespace": "com.example.avro", > "type": "record", > "name": "UserTestOne", > "fields": [{"name": "name", "type": "string"}, {"name": "friend", "type": > ["null", "UserTestOne"], "default":null} ] > } > I get java.lang.StackOverflowError -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (PARQUET-1184) Make DelegatingPositionOutputStream a concrete class
Werner Daehn created PARQUET-1184: - Summary: Make DelegatingPositionOutputStream a concrete class Key: PARQUET-1184 URL: https://issues.apache.org/jira/browse/PARQUET-1184 Project: Parquet Issue Type: Improvement Components: parquet-avro Affects Versions: 1.9.1 Reporter: Werner Daehn Fix For: 1.10.0 I fail to understand why this is an abstract class. In my example I want to write the Parquet file to a java.io.FileOutputStream, hence have to extend the DelegatingPositionOutputStream and store the pos information, increase it in all write(..) methods and return its value in getPos(). Doable of course, but useful? Previously yes but now with the OutputFile changes to decouple it from Hadoop more, I believe no. related to: https://issues.apache.org/jira/browse/PARQUET-1142 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (PARQUET-1183) AvroParquetWriter needs OutputFile based Builder
Werner Daehn created PARQUET-1183: - Summary: AvroParquetWriter needs OutputFile based Builder Key: PARQUET-1183 URL: https://issues.apache.org/jira/browse/PARQUET-1183 Project: Parquet Issue Type: Improvement Components: parquet-avro Affects Versions: 1.9.1 Reporter: Werner Daehn Fix For: 1.10.0 The ParquetWriter got a new Builder(OutputFile). But it cannot be used by the AvroParquetWriter as there is no matching Builder/Constructor. Changes are quite simple: public static Builder builder(OutputFile file) { return new Builder(file) } and in the static Builder class below private Builder(OutputFile file) { super(file); } Note: I am not good enough with builds, maven and git to create a pull request yet. Sorry. Will try to get better here. See: https://issues.apache.org/jira/browse/PARQUET-1142 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (PARQUET-1142) Avoid leaking Hadoop API to downstream libraries
[ https://issues.apache.org/jira/browse/PARQUET-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308295#comment-16308295 ] Werner Daehn edited comment on PARQUET-1142 at 1/2/18 4:24 PM: --- Thanks Ryan, I believe this is one of the most important areas where Parquet can get better. Do you happen to have some sample code on how to use that? My current approach is to extend ParquetWriter including the inner Builder class extending ParquetWriter.Builder. Where I am stuck is the Builder.WriteSupport with its Hadoop Configuration parameter. It has to be implemented but you talk about ParquetReadOptions? And my class needs a constructor but all visible super()'s are Hadoop based. Any hints would be greatly appreciated. Background: In my world there is no such thing as Hadoop but many different kinds of file storage options, HDFS being one of those. Therefore I would be in favor of removing all Hadoop related stuff from Parquet and rather have a java.io.OutputStream (=org.apache.parquet.io.OutputFile) which then can be used to store files in different locations, HDFS full client, HDFS web client, S3, local file system. My ultimate goal for the moment is to have a AvroParquetWriter without Hadoop to save to a local file system in a streaming way. This would then be used to consume the Kafka data of the last 24 hours and write one large Parquet file. Next step would be a matching reader that does pushdown simple filter criteria. Best would be one that allows even primary key lookups in a reasonable amount of time. was (Author: wdaehn): Thanks Ryan, I believe this is one of the most important areas where Parquet can get better. Do you happen to have some sample code on how to use that? My current approach is to extend ParquetWriter including the an inner Builder class extending ParquetWriter.Builder . Where I am stuck is the Builder.WriteSupport with its Hadoop Configuration parameter. It has to be implemented but you talk about ParquetReadOptions? And my class needs a constructor but all visible super()'s are Hadoop based. Any hints would be greatly appreciated. Background: In my world there is no such thing as Hadoop but many different kinds of file storage options, HDFS being one of those. Therefore I would be in favor of removing all Hadoop related stuff from Parquet and rather have a java.io.OutputStream (=org.apache.parquet.io.OutputFile) which then can be used to store files in different locations, HDFS full client, HDFS web client, S3, local file system. My ultimate goal for the moment is to have a AvroParquetWriter without Hadoop to save to a local file system in a streaming way. This would then be used to consume the Kafka data of the last 24 hours and write one large Parquet file. Next step would be a matching reader that does pushdown simple filter criteria. Best would be one that allows even primary key lookups in a reasonable amount of time. > Avoid leaking Hadoop API to downstream libraries > > > Key: PARQUET-1142 > URL: https://issues.apache.org/jira/browse/PARQUET-1142 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 1.10.0 > > > Parquet currently leaks the Hadoop API by requiring callers to pass {{Path}} > and {{Configuration}} instances, and by using Hadoop codecs. {{InputFile}} > and {{SeekableInputStream}} add alternatives to Hadoop classes in some parts > of the read path, but this needs to be extended to the write path and to > avoid passing options through {{Configuration}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PARQUET-1142) Avoid leaking Hadoop API to downstream libraries
[ https://issues.apache.org/jira/browse/PARQUET-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308295#comment-16308295 ] Werner Daehn commented on PARQUET-1142: --- Thanks Ryan, I believe this is one of the most important areas where Parquet can get better. Do you happen to have some sample code on how to use that? My current approach is to extend ParquetWriter including the an inner Builder class extending ParquetWriter.Builder. Where I am stuck is the Builder.WriteSupport with its Hadoop Configuration parameter. It has to be implemented but you talk about ParquetReadOptions? And my class needs a constructor but all visible super()'s are Hadoop based. Any hints would be greatly appreciated. Background: In my world there is no such thing as Hadoop but many different kinds of file storage options, HDFS being one of those. Therefore I would be in favor of removing all Hadoop related stuff from Parquet and rather have a java.io.OutputStream (=org.apache.parquet.io.OutputFile) which then can be used to store files in different locations, HDFS full client, HDFS web client, S3, local file system. My ultimate goal for the moment is to have a AvroParquetWriter without Hadoop to save to a local file system in a streaming way. This would then be used to consume the Kafka data of the last 24 hours and write one large Parquet file. Next step would be a matching reader that does pushdown simple filter criteria. Best would be one that allows even primary key lookups in a reasonable amount of time. > Avoid leaking Hadoop API to downstream libraries > > > Key: PARQUET-1142 > URL: https://issues.apache.org/jira/browse/PARQUET-1142 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 1.10.0 > > > Parquet currently leaks the Hadoop API by requiring callers to pass {{Path}} > and {{Configuration}} instances, and by using Hadoop codecs. {{InputFile}} > and {{SeekableInputStream}} add alternatives to Hadoop classes in some parts > of the read path, but this needs to be extended to the write path and to > avoid passing options through {{Configuration}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)