[ 
https://issues.apache.org/jira/browse/PARQUET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933630#comment-14933630
 ] 

Cheng Lian edited comment on PARQUET-379 at 9/28/15 5:34 PM:
-------------------------------------------------------------

While trying to fix this issue, I got a problem regarding to the {{strict}} 
argument of {{PrimitiveType.union}}, and correspondingly, 
{{MessageType.union}}.  Seems that throughout the whole parquet-mr code base 
(including tests), we always call these methods with {{strict}} being {{true}}, 
which means schema primitive types should match.

Maybe I missed something here, but I don't see a sound use case of non-strict 
schema merging.  Especially, the field types of {{t1.union(t2, false)}} is 
completely determined by {{t1}}, rather than the "wider" ones:
{noformat}
message t1 { required int32 f; }
message t2 { required int64 f; }

t1.union(t2, false) =>
  message t3 { required int32 f; }
{noformat}
Basically we can't use such a schema to read actual Parquet files even if we 
add some sort of automatic "type widening" logic inside Parquet readers since 
the merged one above loses precision.

So my questions are:
# Is there a practical scenario where non-strict schema merging makes sense?
# If not, should we deprecate it? (We can't remove it since 
{{MessageType.union(MessageType, boolean)}} is part of the public API.)



was (Author: lian cheng):
While trying to fix this issue, I got a problem regarding to the {{strict}} 
argument of {{PrimitiveType.union}}, and correspondingly, 
{{MessageType.union}}.  Seems that throughout the whole parquet-mr code base 
(including tests), we always call these methods with {{strict}} being {{true}}, 
which means schema primitive types should match.

Maybe I missed something here, but I don't see a sound use case of non-strict 
schema merging.  Especially, the field types of {{t1.union(t2, false)}} is 
completely determined by {{t1}}, rather than the "wider" types of the two:
{noformat}
message t1 { required int32 f; }
message t2 { required int64 f; }

t1.union(t2, false) =>
  message t3 { required int32 f; }
{noformat}
Basically we can't use such a schema to read actual Parquet files even if we 
add some sort of automatic "type widening" logic inside Parquet readers since 
the merged one above loses precision.

So my questions are:
# Is there a practical scenario where non-strict schema merging makes sense?
# If not, should we deprecate it? (We can't remove it since 
{{MessageType.union(MessageType, boolean)}} is part of the public API.


> PrimitiveType.union erases original type
> ----------------------------------------
>
>                 Key: PARQUET-379
>                 URL: https://issues.apache.org/jira/browse/PARQUET-379
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0
>            Reporter: Cheng Lian
>
> The following ScalaTest test case
> {code}
>   test("merge primitive types") {
>     val expected =
>       Types.buildMessage()
>         .addField(
>           Types
>             .required(INT32)
>             .as(DECIMAL)
>             .precision(7)
>             .scale(2)
>             .named("f"))
>         .named("root")
>     assert(expected.union(expected) === expected)
>   }
> {code}
> produces the following assertion error
> {noformat}
> message root {
>   required int32 f;
> }
>  did not equal message root {
>   required int32 f (DECIMAL(9,0));
> }
> {noformat}
> This is because {{PrimitiveType.union}} doesn't handle original type 
> properly. An open question is that, can two primitive types with the same 
> primitive type name but different original types be unioned?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to