Re: Best way to normalize TFRecordIO

Danny McCormick via dev Mon, 24 Feb 2025 10:42:50 -0800

Thanks for looking into this! I think I like option (2) for the base
transform since it allows us to normalize across languages and get this
added with the lowest amount of effort, plus it doesn't stop us from adding
(1), or (3) in the future (though this may eventually require some more
complex forking depending on which coders each version of the IO supports).


If we do (2), we could also eventually consider adding something like a
`DecodeFromBytes` transform which maps the byte string to a known type
(today this is doable with map though).

Thanks,
Danny

On Mon, Feb 24, 2025 at 1:20 PM Derrick Williams via dev <
[email protected]> wrote:

> I am currently working on normalizing TFRecordIO
> <https://github.com/apache/beam/issues/28692> and ran into a cycle
> dependency issue when converting the byte array to Beam Rows via avroutils
> library.  We have come up with a few other options that we would like to
> figure out which is best or maybe there is a better one.
>
> Some quick background - java version doesn't allow passing in a coder,
> while the python one can.
>
> Current Options:
> 1. Give a user an option to pass in a coder.
> Benefits/Cons: More effort for Java users to use and would require changes
> in TFRecordIO, but would normalize coder usage across both.
>
> 2. Just return a byte string as a row with a single field record and a
> byte string type.
> Benefits/Cons: Should be simpler to implement.
>
> 3. Pass in a preset of coders.
> Benefits/Cons: No effort for users, but no flexibility in choosing a coder.
>
> 4. Just don't do Java normalization and only do it in Python.
> Benefits/Cons: RunInference library has to be done in Python, so most use
> cases will be in Python anyways.
>
> Slightly leaning toward Option 2, but any opinions or other ideas here?
>
> Thanks
> Derrick
>
>
>

Re: Best way to normalize TFRecordIO

Reply via email to