If you absolutely cannot tolerate concurrency an external locking mechanism
is required. While a distributed system often waits for a work item to fail
before trying it, this is not always the case (e.g. backup workers may be
scheduled and whoever finishes first is determined to be the successful
attempt). Worse, it can even be the case where the master thinks a task has
failed (e.g. due to loss of contact) and re-assign the item to
another worker, when in fact the original worker has not fully died (e.g.
it simply lost network connectivity, or entered a bad-but-not-fatal state
where a user-code thread continues on--we call these zombie workers and
though they're uncommon they're nigh impossible to rule out).

On Mon, Jun 12, 2023 at 11:36 AM Bruno Volpato via dev <dev@beam.apache.org>
wrote:

> Hi Stephan,
>
> I am not sure if this is the best way to achieve this, but I've seen
> parallelism being limited by using state / KV and limiting the number of
> keys.
> In your case, you could have the same key for both non concurrency-safe
> operations and when using state, the Beam model will guarantee that they
> aren't concurrently executed.
>
> This blog post may be helpful:
> https://beam.apache.org/blog/stateful-processing/
>
>
>
>
> On Mon, Jun 12, 2023 at 2:21 PM Stephan Hoyer via dev <dev@beam.apache.org>
> wrote:
>
>> Can the Beam data model (specifically the Python SDK) support executing
>> functions that are idempotent but not concurrency-safe?
>>
>> I am thinking of a task like setting up a database (or in my case, a Zarr
>> <https://zarr.dev/> store in Xarray-Beam
>> <https://github.com/google/xarray-beam>) where it is not safe to run
>> setup concurrently, but if the whole operation fails it is safe to retry.
>>
>> I recognize that a better model would be to use entirely atomic
>> operations, but sometimes this can be challenging to guarantee for tools
>> that were not designed with parallel computing in mind.
>>
>> Cheers,
>> Stephan
>>
>

Reply via email to