[ 
https://issues.apache.org/jira/browse/BEAM-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123760#comment-17123760
 ] 

Dmytro Sadovnychyi commented on BEAM-7014:
------------------------------------------

We have slightly related issue when it hangs indefinitely at:


{code:java}
Operation ongoing for over 303.35 seconds in state finish-msecs in step Convert 
html->docx 
gs:--lawinsider-data-runs-20180819-v9.parsed@128#0/html->docx/CachedMap(_HTMLToType)/FlushStorage(gs://lawinsider-temp/cdn.lawinsider.com/docx/__index__)/Batch
 to flush/ParDo(_GlobalWindowsBatchingDoFn)-out0/Write . Current Traceback:
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/start.py", line 
144, in <module>
    main()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/start.py", line 
140, in main
    batchworker.BatchWorker(properties, sdk_pipeline_options).run()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", 
line 843, in run
    deferred_exception_details=deferred_exception_details)
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", 
line 647, in do_work
    work_executor.execute()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
line 178, in execute
    op.finish()
  File 
"/usr/local/lib/python3.7/site-packages/dataflow_worker/nativeavroio.py", line 
309, in __exit__
    self._data_file_writer.fo.close()
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystemio.py", 
line 219, in close
    self._uploader.finish()
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", 
line 612, in finish
    self._upload_thread.join()
  File "/usr/local/lib/python3.7/threading.py", line 1044, in join
    self._wait_for_tstate_lock()
  File "/usr/local/lib/python3.7/threading.py", line 1060, in 
_wait_for_tstate_lock
    elif lock.acquire(block, timeout):
{code}

Only on dataflow, we are able to reproduce it with our specific data and code, 
but so far haven't been able to make a small reproducible example. It happens 
when dataflow tries to write something like this:

{code:java}
gs://PROJECTID-temp/dataflow/REDACTED-20200602-1955.1591098931.320068/dax-tmp-2020-06-02_04_55_56-10640088201905907301-S02-0-a5e8dcb50cb1ab4e/tmp-a5e8dcb50cb1aa4f-shard--try-7f97ef4b6c397250-endshard.avro
{code}

(that file isn't there after the hang)

There's currently a 
[TODO|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/gcsio.py#L610-L611]
 near that piece of code that would prevent pipeline from hanging, but in this 
case it would fail instead (which is still better).

The issue is gone once we added `multiprocessing.set_start_method('spawn', 
force=True)` (we spawn some child processes to isolate execution environment 
from memory errors or just hangs from third party code).

> Flake in gcsio.py / filesystemio.py - NotImplementedError: offset: 0, whence: > 0
> -------------------------------------------------------------------------------
>
>                 Key: BEAM-7014
>                 URL: https://issues.apache.org/jira/browse/BEAM-7014
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Valentyn Tymofieiev
>            Assignee: Chamikara Madhusanka Jayalath
>            Priority: P2
>              Labels: stale-assigned
>
> The flake was observed in Precommit Direct Runner IT (wordcount).
> Full log output: https://pastebin.com/raw/DP5J7Uch.
> {noformat}
> Traceback (most recent call last):
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/gcp/gcsio.py",
>  line 583, in _start_upload
> 08:42:57     self._client.objects.Insert(self._insert_request, 
> upload=self._upload)
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py",
>  line 1154, in Insert
> 08:42:57     upload=upload, upload_config=upload_config)
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/base_api.py",
>  line 715, in _RunMethod
> 08:42:57     http_request, client=self.client)
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 885, in InitializeUpload
> 08:42:57     return self.StreamInChunks()
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 997, in StreamInChunks
> 08:42:57     additional_headers=additional_headers)
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 948, in __StreamMedia
> 08:42:57     self.RefreshResumableUploadState()
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 850, in RefreshResumableUploadState
> 08:42:57     self.stream.seek(self.progress)
> 08:42:57   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify_PR/src/sdks/python/apache_beam/io/filesystemio.py",
>  line 269, in seek
> 08:42:57     offset, whence, self.position, self.last_position))
> 08:42:57 NotImplementedError: offset: 0, whence: 0, position: 48944, last: 0
> {noformat}
> [~chamikara] Might have context to triage this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to