[
https://issues.apache.org/jira/browse/BEAM-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149553#comment-17149553
]
Yichi Zhang commented on BEAM-10002:
------------------------------------
Thanks [~corvin] for tackling this. Your proposal sounds good to me.
Alternatively I think you can also try to add an extra option
maximum_split_bundle_size to the ReadFromMongo, and ensure
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py#L222
does not set the size larger than the maximum value when querying split vector
from mongodb.
Your approach has the advantage of avoiding the problem without user action but
in a worst scenario it could happen that only one worker is slow and keep
retrying the cursor without further splitting the worker load.
The alternative approach requires user tuning (if the issue occurs) but has the
advantage of ensure parallelism and timely processing of each bundle in
advance(I believe if the max value is small enough, mongodb will split the
collection at each document so each cursor will only access one document, if
each document is large enough).
Since I haven't seen the issue before myself, I think you're probably in a
better position to decide which works better for the issue.
> Mongo cursor timeout leads to CursorNotFound error
> --------------------------------------------------
>
> Key: BEAM-10002
> URL: https://issues.apache.org/jira/browse/BEAM-10002
> Project: Beam
> Issue Type: Bug
> Components: io-py-mongodb
> Affects Versions: 2.20.0
> Reporter: Corvin Deboeser
> Assignee: Corvin Deboeser
> Priority: P2
>
> If some work items take a lot of processing time and the cursor of a bundle
> is not queried for too long, then mongodb will timeout the cursor which
> results in
> {code:java}
> pymongo.errors.CursorNotFound: cursor id ... not found
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)