Johan Sternby created BEAM-12434:
------------------------------------
Summary: implement num_shard side_input to WriteToTFRecord
Key: BEAM-12434
URL: https://issues.apache.org/jira/browse/BEAM-12434
Project: Beam
Issue Type: Improvement
Components: io-py-tfrecord
Affects Versions: 2.29.0, 3.0.0, 2.30.0, 2.31.0, 2.32.0
Reporter: Johan Sternby
{{As concisely explained in
[https://stackoverflow.com/questions/49156159/can-i-pass-side-inputs-to-apache-beam-ptransforms|http://example.com/]
}}
EXAMPLES_PER_SHARD = 5.0
num_tfexamples = tfexample_strs | "count tf examples" >>
beam.combiners.Count.Globally()
num_shards = num_tfexamples | ("compute number of shards" >>
beam.Map(lambda num_examples:
int(math.ceil(num_examples / EXAMPLES_PER_SHARD))))
_ = tfexample_strs | ("output to tfrecords" >>
beam.io.WriteToTFRecord(OUTPUT_DIR,
num_shards=beam.pvalue.AsSingleton(num_shards)))
fails with
File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/iobase.py", line
1011, in start_bundle
self.counter = random.randint(0, self.count - 1)
TypeError: unsupported operand type(s) for -: 'AsSingleton' and 'int' [while
running 'output VALIDATION to
tfrecords/Write/WriteImpl/ParDo(_RoundRobinKeyFn)']
WriteToTFRecords op in the python SDK of apache-beam does currently not support
side_input to num_shards.
It can easily be solved by implementing the _RoundRobinKeyFn a bit differently
and calling the ParDo with side_input instead of class init values.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)