[
https://issues.apache.org/jira/browse/PARQUET-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166579#comment-17166579
]
Aaron Blake Niskode-Dossett commented on PARQUET-1684:
------------------------------------------------------
[~gszadovszky] Thank you for following up, I really appreciate it. I am also
not a protobuf expert, but I'll do my best here.
Is there a workaround? I don't believe so. Here's an example that I hopeful
is useful.
I defined this protobuf:
{code:java}
message Person {
int32 foo = 1;
oneof optional_bar {
int32 bar_int = 200;
string bar_string = 201;
}
}{code}
And I wrote some simple code to populate three instances of it (below) and
write it to parquet.
{code:java}
for (int i = 0; i < 3; i += 1) {
com.etsy.grpcparquet.Person message = Person.newBuilder()
.setFoo(i)
.setBarString("hello world")
.build();
message.writeDelimitedTo(out);
}
{code}
The parquet looks like this:
{code:java}
$ parquet-tools show example.parquet
+-------+-----------+--------------+
| foo | bar_int | bar_string |
|-------+-----------+--------------|
| nan | nan | hello world |
| 1 | nan | hello world |
| 2 | nan | hello world |
+-------+-----------+--------------+
{code}
In the first row the fact that foo was set to zero has been lost and it's null.
The `bar_int` column shows what an actually null column would look like.
Similar results in a system like BigQuery:
!image-2020-07-28-12-24-05-087.png!
Would this cause a potential regression? If someone was relying on the fact
that default values are encoded as nulls it would, but that seems unimaginable
to be honest.
> [parquet-protobuf] default protobuf field values are stored as nulls
> --------------------------------------------------------------------
>
> Key: PARQUET-1684
> URL: https://issues.apache.org/jira/browse/PARQUET-1684
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.10.0, 1.11.0
> Reporter: George Haddad
> Assignee: Priyank Bagrecha
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.12.0
>
> Attachments: image-2020-07-28-12-24-05-087.png
>
>
> When the source is a protobuf3 message, and the target file is Parquet, all
> the default values are stored in the output parquet as `{{null`}} instead of
> the actual type's default value.
> For example, if the field is of type `int32`, `double` or `enum` and it
> hasn't been set, the parquet value is `{{null`}} instead of `0`. When the
> field's type is a `string` that hasn't been set, the parquet value is
> {{`null`}} instead of an empty string.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)