[jira] [Comment Edited] (BEAM-13150) Integrate TFRecord/tf.train.Example with Beam Schemas and the DataFrame API

Ryan Clough (Jira) Wed, 17 Nov 2021 10:18:05 -0800


    [ 
https://issues.apache.org/jira/browse/BEAM-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440618#comment-17440618
 ]


Ryan Clough edited comment on BEAM-13150 at 11/17/21, 6:17 PM:
---------------------------------------------------------------

Thanks [~bhulette] for rightfully moving this into its own issue - I'll 
continue the discussion from 
[BEAM-12955|https://issues.apache.org/jira/browse/BEAM-12955] to add some 
context/color.

I agree that on further reading/understanding of 12955, that tf.example.train 
is a distinct usecase from protos in general. It is true that TF Examples are 
just really flexible protos, that depend on a [separate schema 
proto|https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto]
 to define the structure for any given set of TF Examples. In the context of 
TFX, most components that process TF Examples also require the schema as input, 
probably for this reason.

I think it would be a pretty big benefit if we could find a way to make it work 
at least somewhat seamlessly with the beam dataframe API, I suspect that this 
could probably be done with either a unique reader or flag for the existing 
reader, or alternatively some kind of standardized map that uses the existing 
TF Record reader and then maps it to a schema aware pcoll that can then be 
converted to dataframe. Alternatively, maybe there's a way to make TFX BSL's 
pyarrow record batches qualify as a schema-aware pcoll and get the conversion 
utilities "for free".

My personal interest/use-case is that at the company I currently work for, we 
have largely standardized on TFX, which uses TF Records as the data format, and 
beam for many of the data processing operations (as provided by TFX), but the 
beam aspect is almost entirely abstracted away from the end user. Thus we are 
currently in a state where users must choose one of 3 options:
 # Users must make use of an existing TFX component to process their data 
(transformation, model evaluation/metrics generation, etc), where many of the 
API abstractions are either too limiting, or too difficult to adapt for more 
complex use cases (and debugging beam code you didn't write is very difficult 
when all you know is the abstraction)
 # Users must write single threaded ie script-like processing code, and thus 
are limited on compute/memory
 # Users must learn and express their data processing in another framework 
(including beam as an example option)

We're currently stuck in between 1 and 2, as most of our users don't have the 
time to learn beam on top of their existing priorities. My hope was to use the 
dataframe API to bridge the gap for #3 - users would only need to define 
dataframe API operations, and my team could hopefully abstract away most of the 
beam aspects, which would allow for much more scalable and flexible data 
processing.

I took a hack week to explore this approach and ran into issues trying to map 
my input dataset (tf records) to a schema-aware pcoll, which is how I ended up 
here :) I attempted an approach similar to [this SO 
post|https://stackoverflow.com/questions/68537184/how-to-unpack-dictionary-values-inside-a-beam-map-with-python-apache-beam],
 and ran into the same issue, but unfortunately the end original poster didn't 
leave much context on how they solved their issue. In retrospect, I think the 
true solution will involve some more advanced input processing with the TF 
schema as input, as described above. If I have time to delve into this, I'll 
share anything I'm able to get working, which may help push us towards a 
working solution for this issue.


was (Author: ryanclough):
Thanks [~bhulette] for rightfully moving this into its own issue - I'll 
continue the discussion from 
[12995|https://issues.apache.org/jira/browse/BEAM-12995] to add some 
context/color.

I agree that on further reading/understanding of 12995, that tf.example.train 
is a distinct usecase from protos in general. It is true that TF Examples are 
just really flexible protos, that depend on a [separate schema 
proto|https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto]
 to define the structure for any given set of TF Examples. In the context of 
TFX, most components that process TF Examples also require the schema as input, 
probably for this reason.

I think it would be a pretty big benefit if we could find a way to make it work 
at least somewhat seamlessly with the beam dataframe API, I suspect that this 
could probably be done with either a unique reader or flag for the existing 
reader, or alternatively some kind of standardized map that uses the existing 
TF Record reader and then maps it to a schema aware pcoll that can then be 
converted to dataframe. Alternatively, maybe there's a way to make TFX BSL's 
pyarrow record batches qualify as a schema-aware pcoll and get the conversion 
utilities "for free".

My personal interest/use-case is that at the company I currently work for, we 
have largely standardized on TFX, which uses TF Records as the data format, and 
beam for many of the data processing operations (as provided by TFX), but the 
beam aspect is almost entirely abstracted away from the end user. Thus we are 
currently in a state where users must choose one of 3 options:
 # Users must make use of an existing TFX component to process their data 
(transformation, model evaluation/metrics generation, etc), where many of the 
API abstractions are either too limiting, or too difficult to adapt for more 
complex use cases (and debugging beam code you didn't write is very difficult 
when all you know is the abstraction)
 # Users must write single threaded ie script-like processing code, and thus 
are limited on compute/memory
 # Users must learn and express their data processing in another framework 
(including beam as an example option)

We're currently stuck in between 1 and 2, as most of our users don't have the 
time to learn beam on top of their existing priorities. My hope was to use the 
dataframe API to bridge the gap for #3 - users would only need to define 
dataframe API operations, and my team could hopefully abstract away most of the 
beam aspects, which would allow for much more scalable and flexible data 
processing.

I took a hack week to explore this approach and ran into issues trying to map 
my input dataset (tf records) to a schema-aware pcoll, which is how I ended up 
here :) I attempted an approach similar to [this SO 
post|https://stackoverflow.com/questions/68537184/how-to-unpack-dictionary-values-inside-a-beam-map-with-python-apache-beam],
 and ran into the same issue, but unfortunately the end original poster didn't 
leave much context on how they solved their issue. In retrospect, I think the 
true solution will involve some more advanced input processing with the TF 
schema as input, as described above. If I have time to delve into this, I'll 
share anything I'm able to get working, which may help push us towards a 
working solution for this issue.

> Integrate TFRecord/tf.train.Example with Beam Schemas and the DataFrame API
> ---------------------------------------------------------------------------
>
>                 Key: BEAM-13150
>                 URL: https://issues.apache.org/jira/browse/BEAM-13150
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe, sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Brian Hulette
>            Priority: P2
>
> See discussion in BEAM-12995



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (BEAM-13150) Integrate TFRecord/tf.train.Example with Beam Schemas and the DataFrame API

Reply via email to