[
https://issues.apache.org/jira/browse/BEAM-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440618#comment-17440618
]
Ryan Clough edited comment on BEAM-13150 at 11/17/21, 6:17 PM:
---------------------------------------------------------------
Thanks [~bhulette] for rightfully moving this into its own issue - I'll
continue the discussion from
[BEAM-12955|https://issues.apache.org/jira/browse/BEAM-12955] to add some
context/color.
I agree that on further reading/understanding of 12955, that tf.example.train
is a distinct usecase from protos in general. It is true that TF Examples are
just really flexible protos, that depend on a [separate schema
proto|https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto]
to define the structure for any given set of TF Examples. In the context of
TFX, most components that process TF Examples also require the schema as input,
probably for this reason.
I think it would be a pretty big benefit if we could find a way to make it work
at least somewhat seamlessly with the beam dataframe API, I suspect that this
could probably be done with either a unique reader or flag for the existing
reader, or alternatively some kind of standardized map that uses the existing
TF Record reader and then maps it to a schema aware pcoll that can then be
converted to dataframe. Alternatively, maybe there's a way to make TFX BSL's
pyarrow record batches qualify as a schema-aware pcoll and get the conversion
utilities "for free".
My personal interest/use-case is that at the company I currently work for, we
have largely standardized on TFX, which uses TF Records as the data format, and
beam for many of the data processing operations (as provided by TFX), but the
beam aspect is almost entirely abstracted away from the end user. Thus we are
currently in a state where users must choose one of 3 options:
# Users must make use of an existing TFX component to process their data
(transformation, model evaluation/metrics generation, etc), where many of the
API abstractions are either too limiting, or too difficult to adapt for more
complex use cases (and debugging beam code you didn't write is very difficult
when all you know is the abstraction)
# Users must write single threaded ie script-like processing code, and thus
are limited on compute/memory
# Users must learn and express their data processing in another framework
(including beam as an example option)
We're currently stuck in between 1 and 2, as most of our users don't have the
time to learn beam on top of their existing priorities. My hope was to use the
dataframe API to bridge the gap for #3 - users would only need to define
dataframe API operations, and my team could hopefully abstract away most of the
beam aspects, which would allow for much more scalable and flexible data
processing.
I took a hack week to explore this approach and ran into issues trying to map
my input dataset (tf records) to a schema-aware pcoll, which is how I ended up
here :) I attempted an approach similar to [this SO
post|https://stackoverflow.com/questions/68537184/how-to-unpack-dictionary-values-inside-a-beam-map-with-python-apache-beam],
and ran into the same issue, but unfortunately the end original poster didn't
leave much context on how they solved their issue. In retrospect, I think the
true solution will involve some more advanced input processing with the TF
schema as input, as described above. If I have time to delve into this, I'll
share anything I'm able to get working, which may help push us towards a
working solution for this issue.
was (Author: ryanclough):
Thanks [~bhulette] for rightfully moving this into its own issue - I'll
continue the discussion from
[12995|https://issues.apache.org/jira/browse/BEAM-12995] to add some
context/color.
I agree that on further reading/understanding of 12995, that tf.example.train
is a distinct usecase from protos in general. It is true that TF Examples are
just really flexible protos, that depend on a [separate schema
proto|https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto]
to define the structure for any given set of TF Examples. In the context of
TFX, most components that process TF Examples also require the schema as input,
probably for this reason.
I think it would be a pretty big benefit if we could find a way to make it work
at least somewhat seamlessly with the beam dataframe API, I suspect that this
could probably be done with either a unique reader or flag for the existing
reader, or alternatively some kind of standardized map that uses the existing
TF Record reader and then maps it to a schema aware pcoll that can then be
converted to dataframe. Alternatively, maybe there's a way to make TFX BSL's
pyarrow record batches qualify as a schema-aware pcoll and get the conversion
utilities "for free".
My personal interest/use-case is that at the company I currently work for, we
have largely standardized on TFX, which uses TF Records as the data format, and
beam for many of the data processing operations (as provided by TFX), but the
beam aspect is almost entirely abstracted away from the end user. Thus we are
currently in a state where users must choose one of 3 options:
# Users must make use of an existing TFX component to process their data
(transformation, model evaluation/metrics generation, etc), where many of the
API abstractions are either too limiting, or too difficult to adapt for more
complex use cases (and debugging beam code you didn't write is very difficult
when all you know is the abstraction)
# Users must write single threaded ie script-like processing code, and thus
are limited on compute/memory
# Users must learn and express their data processing in another framework
(including beam as an example option)
We're currently stuck in between 1 and 2, as most of our users don't have the
time to learn beam on top of their existing priorities. My hope was to use the
dataframe API to bridge the gap for #3 - users would only need to define
dataframe API operations, and my team could hopefully abstract away most of the
beam aspects, which would allow for much more scalable and flexible data
processing.
I took a hack week to explore this approach and ran into issues trying to map
my input dataset (tf records) to a schema-aware pcoll, which is how I ended up
here :) I attempted an approach similar to [this SO
post|https://stackoverflow.com/questions/68537184/how-to-unpack-dictionary-values-inside-a-beam-map-with-python-apache-beam],
and ran into the same issue, but unfortunately the end original poster didn't
leave much context on how they solved their issue. In retrospect, I think the
true solution will involve some more advanced input processing with the TF
schema as input, as described above. If I have time to delve into this, I'll
share anything I'm able to get working, which may help push us towards a
working solution for this issue.
> Integrate TFRecord/tf.train.Example with Beam Schemas and the DataFrame API
> ---------------------------------------------------------------------------
>
> Key: BEAM-13150
> URL: https://issues.apache.org/jira/browse/BEAM-13150
> Project: Beam
> Issue Type: Improvement
> Components: dsl-dataframe, sdk-py-core
> Reporter: Brian Hulette
> Assignee: Brian Hulette
> Priority: P2
>
> See discussion in BEAM-12995
--
This message was sent by Atlassian Jira
(v8.20.1#820001)