[
https://issues.apache.org/jira/browse/BEAM-12955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435996#comment-17435996
]
Ryan Clough commented on BEAM-12955:
------------------------------------
They are indeed protobufs, from the link you pasted about docs:
{noformat}
The tf.train.Example message (or protobuf) is a flexible message type that
represents a {"string": value} mapping.{noformat}
TF Record is just a file format that consiststs of a series of TF Example
protos serialized to disk. When you read with
[beam.io.tfrecordio.ReadFromTFRecord|https://beam.apache.org/releases/pydoc/2.16.0/apache_beam.io.tfrecordio.html],
the output is a pcoll of bytes serialized TF Examples, unless you give it a
proto coder to materialize them into actual proto objects
> Add support for inferring Beam Schemas from Python protobuf types
> -----------------------------------------------------------------
>
> Key: BEAM-12955
> URL: https://issues.apache.org/jira/browse/BEAM-12955
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Reporter: Brian Hulette
> Assignee: Svetak Vihaan Sundhar
> Priority: P2
> Labels: stale-assigned
>
> Just as we can infer a Beam Schema from a NamedTuple type
> ([code|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/schemas.py]),
> we should have support for inferring a schema from a [protobuf-generated
> Python
> type|https://developers.google.com/protocol-buffers/docs/pythontutorial].
> This should integrate well with the rest of the schema infrastructure. For
> example it should be possible to use schema-aware transforms like
> [SqlTransform|https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform],
>
> [Select|https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.transforms.core.html#apache_beam.transforms.core.Select],
> or
> [beam.dataframe.convert.to_dataframe|https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe]
> on a PCollection that is annotated with a protobuf type. For example (using
> the addressbook_pb2 example from the
> [tutorial|https://developers.google.com/protocol-buffers/docs/pythontutorial#reading-a-message]):
> {code:python}
> import adressbook_pb2
> import apache_beam as beam
> from apache_beam.dataframe.convert import to_dataframe
> pc = (input_pc |
> beam.Map(create_person).with_output_type(addressbook_pb2.Person))
> df = to_dataframe(pc) # deferred dataframe with fields id, name, email, ...
> # OR
> pc | beam.transforms.SqlTransform("SELECT name WHERE email = '[email protected]'
> FROM PCOLLECTION")
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)