damccorm opened a new issue, #21180:
URL: https://github.com/apache/beam/issues/21180
In some cases when we have some big individual element (e.g. after a
Combine) and given a combination of side-inputs afterwards Dataflow might
decide to materialize PCollection into temporary storage (on GCS using avrofile
with simple "bytes" schema) – this process fails if our element is more than
2GB in size.
Job ID: 2021-09-15_02_14_12-14362936082336076824
Stacktrace:
```
Error message from worker: Traceback (most recent call last): File
"/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line
651, in do_work work_executor.execute() File
"/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 181,
in execute op.finish() File "dataflow_worker/native_operations.py", line 93, in
dataflow_worker.native_operations.NativeWriteOperation.finish File
"dataflow_worker/native_operations.py", line 94, in
dataflow_worker.native_operations.NativeWriteOperation.finish File
"dataflow_worker/native_operations.py", line 95, in
dataflow_worker.native_operations.NativeWriteOperation.finish File
"/usr/local/lib/python3.7/site-packages/dataflow_worker/nativeavroio.py", line
308, in __exit__ self._data_file_writer.flush() File "fastavro/_write.pyx",
line 664, in fastavro._write.Writer.flush File "fastavro/_write.pyx", line 639,
in fastavro._write.Writer.dump File "fastavro/_write.pyx", line 451, in
fastavro._write.snappy_write_bloc
k File "fastavro/_write.pyx", line 458, in fastavro._write.snappy_write_block
File "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystemio.py",
line 200, in write self._uploader.put(b) File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 720,
in put self._conn.send_bytes(data.tobytes()) File
"/usr/local/lib/python3.7/multiprocessing/connection.py", line 200, in
send_bytes self._send_bytes(m[offset:offset **** size]) File
"/usr/local/lib/python3.7/multiprocessing/connection.py", line 393, in
_send_bytes header = struct.pack("!i", n) struct.error: 'i' format requires
-2147483648 <= number <= 2147483647
```
This can be solved via a Reshuffle which forces it to be materialized on a
shuffling service instead, which doesn't have this limitation.
Imported from Jira
[BEAM-12900](https://issues.apache.org/jira/browse/BEAM-12900). Original Jira
may contain additional context.
Reported by: sadovnychyi.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]