Eliaaazzz commented on PR #37352: URL: https://github.com/apache/beam/pull/37352#issuecomment-3771596092
@GlobalStar117 @GlobalStar117 Thanks for the detailed analysis! You are absolutely correct about Java type erasure: new MyDoFn<String>() and new MyDoFn<Integer>() indeed share the same runtime Class object. However, in Apache Beam, a DoFn's behavior isn't defined solely by its Class. We rely heavily on TypeDescriptor to handle serialization (Coders) and schema verification. Why this fix is necessary: Even if the raw class is the same, users can override getInputTypeDescriptor() (or use mechanisms that capture types) to provide different type information for the same raw DoFn class. The Evidence: My regression test (testCacheKeyCollisionProof) explicitly creates two instances of the same DoFn class but forces them to return different TypeDescriptors. Without this fix, the factory returns the same cached Invoker for both. The Consequence: If the first Invoker is generated/cached with logic specific to String, and then reused for an Integer context (because the cache key ignored the TypeDescriptor), it leads to runtime issues (like incorrect validation or potential ClassCastException in downstream logic that relies on the Invoker's signature). So, while type erasure applies to the user's class, the generated Invoker needs to be aware of the specific TypeDescriptor context to function correctly within the Beam pipeline. This PR ensures the cache respects that distinction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
