[ 
https://issues.apache.org/jira/browse/PARQUET-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166579#comment-17166579
 ] 

Aaron Blake Niskode-Dossett commented on PARQUET-1684:
------------------------------------------------------

[~gszadovszky] Thank you for following up, I really appreciate it.  I am also 
not a protobuf expert, but I'll do my best here.

 

Is there a workaround?  I don't believe so.  Here's an example that I hopeful 
is useful.

I defined this protobuf:

 
{code:java}
message Person {
 int32 foo = 1;
 oneof optional_bar {
 int32 bar_int = 200;
 string bar_string = 201;
 }
}{code}
 

 

And I wrote some simple code to populate three instances of it (below) and 
write it to parquet.

 
{code:java}
for (int i = 0; i < 3; i += 1) {
 com.etsy.grpcparquet.Person message = Person.newBuilder()
 .setFoo(i)
 .setBarString("hello world")
 .build();
 message.writeDelimitedTo(out);
} 
{code}
 

The parquet looks like this:

 
{code:java}
$ parquet-tools show example.parquet
+-------+-----------+--------------+
| foo   | bar_int   | bar_string   |
|-------+-----------+--------------|
| nan   | nan       | hello world  |
| 1     | nan       | hello world  |
| 2     | nan       | hello world  |
+-------+-----------+--------------+
 
{code}
 

In the first row the fact that foo was set to zero has been lost and it's null. 
 The `bar_int` column shows what an actually null column would look like.  
Similar results in a system like BigQuery:

!image-2020-07-28-12-24-05-087.png!

 

Would this cause a potential regression?  If someone was relying on the fact 
that default values are encoded as nulls it would, but that seems unimaginable 
to be honest.

> [parquet-protobuf] default protobuf field values are stored as nulls
> --------------------------------------------------------------------
>
>                 Key: PARQUET-1684
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1684
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.10.0, 1.11.0
>            Reporter: George Haddad
>            Assignee: Priyank Bagrecha
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.12.0
>
>         Attachments: image-2020-07-28-12-24-05-087.png
>
>
> When the source is a protobuf3 message, and the target file is Parquet, all 
> the default values are stored in the output parquet as `{{null`}} instead of 
> the actual type's default value.
>  For example, if the field is of type `int32`, `double` or `enum` and it 
> hasn't been set, the parquet value is `{{null`}} instead of `0`. When the 
> field's type is a `string` that hasn't been set, the parquet value is 
> {{`null`}} instead of an empty string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to