On Fri, Jan 5, 2024 at 7:38 AM Joey Tran <joey.t...@schrodinger.com> wrote:

> I've been working with a few data types that are in practice
> unpicklable and I've run into a couple issues stemming from the `Any` type
> hint, which when used, will result in the PickleCoder getting used even if
> there's a coder in the coder registry that matches the data element.
>

This is likely because we don't know the data type at the time we choose
the coder.


> This was pretty unexpected to me and can result in pretty cryptic
> downstream issues. In the best case, you get an error at pickling time [1],
> and in the worse case, the pickling "succeeds" (since many objects can get
> (de)pickeld without obvious error) but then results in downstream issues
> (e.g. some data doesn't survive depickling).
>

It shouldn't be the case that an object depickles successfully but
incorrectly; sounds like a bug in some custom pickling code.

One example case of the latter is if you flatten a few pcollections
> including a pcollection generated by `beam.Create([])` (the inferred output
> type an empty create becomes Any)
>

Can you add a type hint to the Create?


> Would it make sense to introduce a new fallback coder that takes
> precedence over the `PickleCoder` that encodes both the data type (by just
> pickling it) and the data encoded using the registry-found coder?
>

This is essentially re-implementing pickle :)


> This would have some space ramifications for storing the data type for
> every element. Of course this coder would only kick in _if_ the data type
> was found in the registry, otherwise we'd proceed to the picklecoder like
> we do currently
>

I do not think we'd want to introduce this as the default--that'd likely
make common cases much more expensive. IIRC you can manually override the
fallback coder with one of your own choosing. Alternatively, you could look
at using copyreg for your problematic types.


> [1] https://github.com/apache/beam/issues/29908 (Issue arises from
> ReshuffleFromKey using `Any` as a pcollection type
>

Reply via email to