On Fri, Jan 5, 2024 at 9:42 AM Joey Tran <joey.t...@schrodinger.com> wrote:
>
>
> I think my original message made it sound like what I thought was confusing 
> was how `Any` works. The scenario that I actually think is confusing is *if a 
> user registers a coder for a data type, this preference will get ignored in 
> non-obvious situations and can (and in my scenario, has) result in 
> non-obvious downstream issues.*


I agree this can be confusing. Essentially, Coders are attached to
PCollections (which are assumed to be of homogeneous type) at compile
time.

>
> On Fri, Jan 5, 2024 at 12:05 PM Robert Bradshaw via dev <dev@beam.apache.org> 
> wrote:
>>
>> On Fri, Jan 5, 2024 at 7:38 AM Joey Tran <joey.t...@schrodinger.com> wrote:
>>>
>>> I've been working with a few data types that are in practice unpicklable 
>>> and I've run into a couple issues stemming from the `Any` type hint, which 
>>> when used, will result in the PickleCoder getting used even if there's a 
>>> coder in the coder registry that matches the data element.
>>
>>
>> This is likely because we don't know the data type at the time we choose the 
>> coder.
>>
>>>
>>> This was pretty unexpected to me and can result in pretty cryptic 
>>> downstream issues. In the best case, you get an error at pickling time [1], 
>>> and in the worse case, the pickling "succeeds" (since many objects can get 
>>> (de)pickeld without obvious error) but then results in downstream issues 
>>> (e.g. some data doesn't survive depickling).
>>
>>
>> It shouldn't be the case that an object depickles successfully but 
>> incorrectly; sounds like a bug in some custom pickling code.
>
> You don't need custom pickling code for this to happen. For a contrived 
> example, you could imagine some class that caches some state specific to a 
> local system and saves it to a private local variable. If you pickle one of 
> these and then unpickle it on a different system, it would've been unpickled 
> successfully but would be in a bad state.
>
> Rather than mucking around with custom pickling, someone might want to just 
> implement a coder for their special class instead.


It's the same work in both cases, though I can see a coder being
preferable if one does not own the class. (Though copyreg should work
just as well.) In that case, I'd consider explicitly making the class
throw an exception on pickling (though I agree it's hard to see how
one could know to do this by default).

>>>
>>> One example case of the latter is if you flatten a few pcollections 
>>> including a pcollection generated by `beam.Create([])` (the inferred output 
>>> type an empty create becomes Any)
>>
>>
>> Can you add a type hint to the Create?
>
> Yeah this fixes the issue, it's just not obvious (or at least to me) that (1) 
> beam.Create([]) will have an output type of Any (often times the parameter to 
> beam.Create will be some local variable which makes it less obvious) and that


We could update Beam to let the type hint be the empty union, which
would correspond to a coder that can't encode/decode anything, but
when unioned with others (e.g. in a Flatten) does not "taint" the
rest. This doesn't solve unioning two other disjoint types resolving
to the Any coder though.

>
> (2) in this particular case, _how_ downstream pcollections get decoded will 
> be slightly different. In the worse case, the issue won't even result in an 
> error at decoding time (as mentioned before), so then you have to backtrack 
> from some possibly unrelated sounding traceback.
>
>>>
>>> Would it make sense to introduce a new fallback coder that takes precedence 
>>> over the `PickleCoder` that encodes both the data type (by just pickling 
>>> it) and the data encoded using the registry-found coder?
>>
>>
>> This is essentially re-implementing pickle :)
>
> Pickle doesn't use coders from the coder registry which I think is the key 
> distinction here

Pickle is just a dynamic dispatch of type -> encoder.

>>>
>>> This would have some space ramifications for storing the data type for 
>>> every element. Of course this coder would only kick in _if_ the data type 
>>> was found in the registry, otherwise we'd proceed to the picklecoder like 
>>> we do currently
>>
>>
>> I do not think we'd want to introduce this as the default--that'd likely 
>> make common cases much more expensive. IIRC you can manually override the 
>> fallback coder with one of your own choosing. Alternatively, you could look 
>> at using copyreg for your problematic types.
>>
>
> Ah you can indeed override the fallback coder. Okay I'll just do that for our 
> use of Beam.
>
> For sake of discussion though, I think it'd be a small-ish cost incurred once 
> per data type in the collection. The first time we run into a data type, we 
> see if it's the registry and if it is, use that coder, otherwise cache the 
> result (`types_not_in_registry: set`). All data types not in the registry 
> could then just be fast tracked to the picklecoder as before.

The risk is that if I register a coder for unpicklable type T and I
have a PCollection[T], this will be efficient, but if for any reason
Beam is not able to deduce I have a PCollection[T] it won't fall back
to being inefficient. I know of cases where types are explicitly made
unpickable to ensure we are using the proper coders.

> Oh actually, overriding the fallback coder doesn't actually do anything 
> because the issue is not with the fallback coders in the registry but the 
> fastprimitivescoder's fallback coder

Hmm... if the fallback coder is by default
FastPrimitives(PickleCoder), could you instead register the fallback
coder to be FastPrimitives(fallback=DynamicCoderLookup(fallback=PickleCoder))?

>>>
>>> [1] https://github.com/apache/beam/issues/29908 (Issue arises from 
>>> ReshuffleFromKey using `Any` as a pcollection type

Reply via email to