Best way to normalize TFRecordIO

Derrick Williams via dev Mon, 24 Feb 2025 10:18:12 -0800

I am currently working on normalizing TFRecordIO
<https://github.com/apache/beam/issues/28692> and ran into a cycle
dependency issue when converting the byte array to Beam Rows via avroutils
library.  We have come up with a few other options that we would like to
figure out which is best or maybe there is a better one.


Some quick background - java version doesn't allow passing in a coder,
while the python one can.

Current Options:
1. Give a user an option to pass in a coder.
Benefits/Cons: More effort for Java users to use and would require changes
in TFRecordIO, but would normalize coder usage across both.

2. Just return a byte string as a row with a single field record and a byte
string type.
Benefits/Cons: Should be simpler to implement.

3. Pass in a preset of coders.
Benefits/Cons: No effort for users, but no flexibility in choosing a coder.

4. Just don't do Java normalization and only do it in Python.
Benefits/Cons: RunInference library has to be done in Python, so most use
cases will be in Python anyways.

Slightly leaning toward Option 2, but any opinions or other ideas here?

Thanks
Derrick

Best way to normalize TFRecordIO

Reply via email to