Hi Qinghui:

Thanks for the detailed extra explanations.  Yes we found (in our use case)
that de-nesting these wrapper messages has great benefits in terms of
user-friendliness. For example, in the current Parquet writer, a protobuf
field UUID defined as StringValue would be written as a  UUID GROUP plus
UUID.value as binary (string). When querying from Hive/Presto, it would be
much easier for data engineers/scientists to refer to such a field
directly, say "UUID=xxxxx" rather than "UUID.value=xxxxx".   Similar
rationale also applies a few other protobuf WKT such as Timestamp and
Duration.

Good observation on maintaining compatibility during the reading process.
Although not a must in our use case, I can see there is value to have an
e2e solution which allows de-nested fields to be read back consistently
according to their protobuf definitions.  For now, I would assume a similar
configuration could be applied on the reader side, which allows de-nested
fields to be mapped to their original types. The reader is given the
original protobuf definition hence shall be able to detect the discrepancy
between the de-nested Parquet data and the nested protobuf schema.

Other thoughts and comments are welcome as well.

-
Ying


On Mon, Jun 17, 2019 at 5:10 AM XU Qinghui <[email protected]> wrote:

> Hello, Ying
>
> From my own experience, the proposal seems interesting. So to give some
> more context about this "protobuf wrapper" for people that are not familiar
> with it: protobuf3 drops support for "null" semantics for primitives both
> in its wire format and in its API, for people that wish to have nullable
> fields, they provide the "wrapper" to nest the primitive fields in some
> struct. The current parquet-protobuf implementation is converting protobuf
> schema to parquet schema in a loyal way, so that all the wrappers will
> become an intermediate struct in parquet field path. Denesting those
> wrappers should make the parquet file (schema) easier to use.
> In the meantime, it seems to me the proposal is more focused on the
> writing. Maybe it is worth to think about how to make reading
> backward/forward compatible.
>
> cc @lukasnalezenec @zivanfi @rdblue
>
> Best regards,
>
>
> Le ven. 14 juin 2019 à 02:42, ying <[email protected]> a écrit :
>
> > Dear Parquet community:
> >
> > We are working on a data pipeline which takes on protobuf data and write
> in
> > Parquet. Currently we take advantage of the Parquet proto writer support
> > <https://github.com/apache/parquet-mr/tree/master/parquet-protobuf>.
> >
> > While the existing Parquet protobuf writer preserves all the message
> > structure of a Protobuf definition, in our case users often prefer
> > de-nesting the protobuf wrappers classes and filling in the same field
> with
> > simply its "value" data.  We have implemented some basic functionality to
> > achieve this, on top of the existing Parquet-proto writer. For details,
> > please refer to Parquet-1595
> > <https://issues.apache.org/jira/browse/PARQUET-1595> .
> >
> > We would like to solicit comments, and would be happy to contribute if
> the
> > community thinks it is a sound idea to pursue.  Any comments or pointers
> to
> > related prior discussions are welcome.
> >
> > Thanks!
> >
> > -
> > Ying
> >
>

Reply via email to