Thanks for all the feedback. Two possible ideas that occur to me: - Create TypeSafeSerializableCoder or somesuch which extends SerializableCoder and enforces the check as in the PR. Update the documentation to suggest using the new coder if you don't need to support collections as the raw type or subclasses. - Add a new "of" method to SerializableCoder which takes a boolean parameter to control whether we perform this check or not (defaults to true?).
Colm. On Tue, Mar 24, 2020 at 2:11 AM Luke Cwik <[email protected]> wrote: > I have seen people ingest data using SerializableCoder. > > On Mon, Mar 23, 2020 at 2:51 PM Kenneth Knowles <[email protected]> wrote: > >> I won't bring other people's words from private@, but can share mine. I >> don't believe it exposes anything new. >> >> > If it is SerializableCoder - attacker controls the other end of e.g. >> Kafka or Pubsub that is decoding w/ ObjectInputStream - [then we could have >> an allowlist or try to automatically construct an allowlist] and otherwise >> there is no vulnerability for internal coders. >> >> I have never seen or heard of a user doing dynamic deserialization >> dispatch on ingestion, but that doesn't mean it doesn't happen. If it is >> important to someone then they would need a more secure solution than >> SerializableCoder. >> >> Side note: it would be great to provide an efficient and usable solution >> for the problem of wanting to dynamically dispatch serde in the middle of a >> pipeline. It is actually independent from being able to provide coders for >> a wide variety of types, which we can do a bunch of different (mostly >> better) ways. (has a better solution been built since the last time I >> thought about this?) >> >> Kenn >> >> On Mon, Mar 23, 2020 at 1:36 PM Ismaël Mejía <[email protected]> wrote: >> >>> The link to the previous covnersation (discussion happened in private@ >>> and I suppose we can bring some relevant bits here if needed) >>> >>> https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E >>> >>> I remember Robert had some points there, but I am not sure we >>> found/agreed on a solution that was relevant and did not break current >>> users and their user experience (like the case of blacklists). >>> >>> On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <[email protected]> wrote: >>> > >>> > Being able to have something that can encode any object (or at least a >>> large class of objects) is extremely powerful so requiring >>> SerializableCoder<T> to only encode T.class would hurt our users. >>> > >>> > I believe someone looked at this kind of problem before and we came to >>> an agreement of usng an explicit approve/deny list on the class names which >>> would address the security concern. I don't remember the thread though and >>> couldn't find the thread after a few minutes of searching. >>> > >>> > >>> > >>> > On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <[email protected]> >>> wrote: >>> >> >>> >> So you think the spec for SerializableCoder<T> (currently doesn't >>> really have one) should be that it dynamically dispatches what it >>> deserializes? I had imagined we would treat it more as a statically >>> determined coder, so because it is invariant in T we would not allow up or >>> down casts (they are unsafe). But we probably don't actually have the >>> static information to do that anyhow so you are probably right. >>> >> >>> >> I wonder about the threat model here. Is this the event that the >>> runner (managed service or bespoke cluster) is compromised and is >>> attempting RCE on the Java SDK harness or runner-specific Java-based worker? >>> >> >>> >> Kenn >>> >> >>> >> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <[email protected]> wrote: >>> >>> >>> >>> I don't think this is going to work since SerializableCoder<T> >>> should be able to decode T and all objects that implement/extend T. I'm >>> pretty sure SerializableCoder<Set/List/...> is common enough while the >>> concrete type is HashSet/ArrayList/... >>> >>> I'm pretty sure there is some way you could come up with some way >>> for making this optin though. >>> >>> >>> >>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh < >>> [email protected]> wrote: >>> >>>> >>> >>>> Thanks Kenn. I submitted a PR here: >>> https://github.com/apache/beam/pull/11191 >>> >>>> >>> >>>> Colm. >>> >>>> >>> >>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <[email protected]> >>> wrote: >>> >>>>> >>> >>>>> I think this is fine. The same coder is used for encode and >>> decode, so the Class object should be the same as well. Inheritance is not >>> part of the Beam model (thank goodness) so this is a language-specific >>> concern. As far as the model is concerned, the full URN and the payload of >>> the coder is its identity and coders with different identities have no >>> inheritance or compatibility relationship. Pipeline snapshot/update is an >>> edge case, but changing coder is not supported by any runner I know of, and >>> probably won't be until we have some rather large new ideas. >>> >>>>> >>> >>>>> Kenn >>> >>>>> >>> >>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh < >>> [email protected]> wrote: >>> >>>>>> >>> >>>>>> Hi, >>> >>>>>> >>> >>>>>> I have a question on SerializableCoder. I'm looking at hardening >>> the Java Object deserialization that is taking place. We have a "Class<T> >>> type" that is used to decode the input stream: >>> >>>>>> >>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream); >>> >>>>>> return type.cast(ois.readObject()); >>> >>>>>> >>> >>>>>> What I would like to do would be something like: >>> >>>>>> >>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) { >>> >>>>>> @Override >>> >>>>>> protected Class<?> resolveClass(ObjectStreamClass desc) >>> throws IOException, ClassNotFoundException { >>> >>>>>> if (!desc.getName().equals(type.getName())) { >>> >>>>>> throw new InvalidClassException("Unauthorized >>> deserialization attempt", desc.getName()); >>> >>>>>> } >>> >>>>>> return super.resolveClass(desc); >>> >>>>>> } >>> >>>>>> }; >>> >>>>>> return type.cast(ois.readObject()); >>> >>>>>> >>> >>>>>> This would prevent a possible security hole where an attacker >>> could try to force the recipient of the input stream to deserialize to a >>> gadget class or the like for a RCE. >>> >>>>>> >>> >>>>>> The question is - does the deserialized type have to correspond >>> exactly to the supplied Class? Or is it supported that it's a base type / >>> abstract class? If the latter then my idea won't really work. But if the >>> type corresponds exactly then it should work OK. >>> >>>>>> >>> >>>>>> Thanks, >>> >>>>>> >>> >>>>>> Colm. >>> >>
