[
https://issues.apache.org/jira/browse/BEAM-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123760#comment-17123760
]
Dmytro Sadovnychyi commented on BEAM-7014:
------------------------------------------
We have slightly related issue when it hangs indefinitely at:
{code:java}
Operation ongoing for over 303.35 seconds in state finish-msecs in step Convert
html->docx
gs:--lawinsider-data-runs-20180819-v9.parsed@128#0/html->docx/CachedMap(_HTMLToType)/FlushStorage(gs://lawinsider-temp/cdn.lawinsider.com/docx/__index__)/Batch
to flush/ParDo(_GlobalWindowsBatchingDoFn)-out0/Write . Current Traceback:
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/start.py", line
144, in <module>
main()
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/start.py", line
140, in main
batchworker.BatchWorker(properties, sdk_pipeline_options).run()
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py",
line 843, in run
deferred_exception_details=deferred_exception_details)
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py",
line 647, in do_work
work_executor.execute()
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py",
line 178, in execute
op.finish()
File
"/usr/local/lib/python3.7/site-packages/dataflow_worker/nativeavroio.py", line
309, in __exit__
self._data_file_writer.fo.close()
File "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystemio.py",
line 219, in close
self._uploader.finish()
File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 612, in finish
self._upload_thread.join()
File "/usr/local/lib/python3.7/threading.py", line 1044, in join
self._wait_for_tstate_lock()
File "/usr/local/lib/python3.7/threading.py", line 1060, in
_wait_for_tstate_lock
elif lock.acquire(block, timeout):
{code}
Only on dataflow, we are able to reproduce it with our specific data and code,
but so far haven't been able to make a small reproducible example. It happens
when dataflow tries to write something like this:
{code:java}
gs://PROJECTID-temp/dataflow/REDACTED-20200602-1955.1591098931.320068/dax-tmp-2020-06-02_04_55_56-10640088201905907301-S02-0-a5e8dcb50cb1ab4e/tmp-a5e8dcb50cb1aa4f-shard--try-7f97ef4b6c397250-endshard.avro
{code}
(that file isn't there after the hang)
There's currently a
[TODO|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/gcsio.py#L610-L611]
near that piece of code that would prevent pipeline from hanging, but in this
case it would fail instead (which is still better).
The issue is gone once we added `multiprocessing.set_start_method('spawn',
force=True)` (we spawn some child processes to isolate execution environment
from memory errors or just hangs from third party code).
> Flake in gcsio.py / filesystemio.py - NotImplementedError: offset: 0, whence: > 0
> -------------------------------------------------------------------------------
>
> Key: BEAM-7014
> URL: https://issues.apache.org/jira/browse/BEAM-7014
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Reporter: Valentyn Tymofieiev
> Assignee: Chamikara Madhusanka Jayalath
> Priority: P2
> Labels: stale-assigned
>
> The flake was observed in Precommit Direct Runner IT (wordcount).
> Full log output: https://pastebin.com/raw/DP5J7Uch.
> {noformat}
> Traceback (most recent call last):
> 08:42:57 File
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/gcp/gcsio.py",
> line 583, in _start_upload
> 08:42:57 self._client.objects.Insert(self._insert_request,
> upload=self._upload)
> 08:42:57 File
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py",
> line 1154, in Insert
> 08:42:57 upload=upload, upload_config=upload_config)
> 08:42:57 File
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/base_api.py",
> line 715, in _RunMethod
> 08:42:57 http_request, client=self.client)
> 08:42:57 File
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py",
> line 885, in InitializeUpload
> 08:42:57 return self.StreamInChunks()
> 08:42:57 File
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py",
> line 997, in StreamInChunks
> 08:42:57 additional_headers=additional_headers)
> 08:42:57 File
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py",
> line 948, in __StreamMedia
> 08:42:57 self.RefreshResumableUploadState()
> 08:42:57 File
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py",
> line 850, in RefreshResumableUploadState
> 08:42:57 self.stream.seek(self.progress)
> 08:42:57 File
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/filesystemio.py",
> line 269, in seek
> 08:42:57 offset, whence, self.position, self.last_position))
> 08:42:57 NotImplementedError: offset: 0, whence: 0, position: 48944, last: 0
> {noformat}
> [~chamikara] Might have context to triage this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)