Hi Qinghui: Thanks for the detailed extra explanations. Yes we found (in our use case) that de-nesting these wrapper messages has great benefits in terms of user-friendliness. For example, in the current Parquet writer, a protobuf field UUID defined as StringValue would be written as a UUID GROUP plus UUID.value as binary (string). When querying from Hive/Presto, it would be much easier for data engineers/scientists to refer to such a field directly, say "UUID=xxxxx" rather than "UUID.value=xxxxx". Similar rationale also applies a few other protobuf WKT such as Timestamp and Duration.
Good observation on maintaining compatibility during the reading process. Although not a must in our use case, I can see there is value to have an e2e solution which allows de-nested fields to be read back consistently according to their protobuf definitions. For now, I would assume a similar configuration could be applied on the reader side, which allows de-nested fields to be mapped to their original types. The reader is given the original protobuf definition hence shall be able to detect the discrepancy between the de-nested Parquet data and the nested protobuf schema. Other thoughts and comments are welcome as well. - Ying On Mon, Jun 17, 2019 at 5:10 AM XU Qinghui <[email protected]> wrote: > Hello, Ying > > From my own experience, the proposal seems interesting. So to give some > more context about this "protobuf wrapper" for people that are not familiar > with it: protobuf3 drops support for "null" semantics for primitives both > in its wire format and in its API, for people that wish to have nullable > fields, they provide the "wrapper" to nest the primitive fields in some > struct. The current parquet-protobuf implementation is converting protobuf > schema to parquet schema in a loyal way, so that all the wrappers will > become an intermediate struct in parquet field path. Denesting those > wrappers should make the parquet file (schema) easier to use. > In the meantime, it seems to me the proposal is more focused on the > writing. Maybe it is worth to think about how to make reading > backward/forward compatible. > > cc @lukasnalezenec @zivanfi @rdblue > > Best regards, > > > Le ven. 14 juin 2019 à 02:42, ying <[email protected]> a écrit : > > > Dear Parquet community: > > > > We are working on a data pipeline which takes on protobuf data and write > in > > Parquet. Currently we take advantage of the Parquet proto writer support > > <https://github.com/apache/parquet-mr/tree/master/parquet-protobuf>. > > > > While the existing Parquet protobuf writer preserves all the message > > structure of a Protobuf definition, in our case users often prefer > > de-nesting the protobuf wrappers classes and filling in the same field > with > > simply its "value" data. We have implemented some basic functionality to > > achieve this, on top of the existing Parquet-proto writer. For details, > > please refer to Parquet-1595 > > <https://issues.apache.org/jira/browse/PARQUET-1595> . > > > > We would like to solicit comments, and would be happy to contribute if > the > > community thinks it is a sound idea to pursue. Any comments or pointers > to > > related prior discussions are welcome. > > > > Thanks! > > > > - > > Ying > > >
