Luke, can a similar approach be used in classic runners? What would we need to change to achieve it?
On Thu, Mar 26, 2020 at 4:06 PM Luke Cwik <[email protected]> wrote: > > From the private@ thread, I suggested: > "With the JvmInitializer[1] being supported by Dataflow and the portable Java > container, users would be able to write code which sets the system property > jdk.serialFilter or by configuring > ObjectInputFilter.Config.setSerialFilter(filter)[2]" > > This could become a documentation change to SerializableCoder. > > 1: > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/harness/JvmInitializer.java > 2: > https://docs.oracle.com/javase/10/core/serialization-filtering1.htm#JSCOR-GUID-952E2328-AB66-4412-8B6B-3BCCB3195C25 > > On Thu, Mar 26, 2020 at 4:51 AM Colm O hEigeartaigh <[email protected]> > wrote: >> >> Thanks for all the feedback. Two possible ideas that occur to me: >> >> - Create TypeSafeSerializableCoder or somesuch which extends >> SerializableCoder and enforces the check as in the PR. Update the >> documentation to suggest using the new coder if you don't need to support >> collections as the raw type or subclasses. >> - Add a new "of" method to SerializableCoder which takes a boolean >> parameter to control whether we perform this check or not (defaults to >> true?). >> >> Colm. >> >> On Tue, Mar 24, 2020 at 2:11 AM Luke Cwik <[email protected]> wrote: >>> >>> I have seen people ingest data using SerializableCoder. >>> >>> On Mon, Mar 23, 2020 at 2:51 PM Kenneth Knowles <[email protected]> wrote: >>>> >>>> I won't bring other people's words from private@, but can share mine. I >>>> don't believe it exposes anything new. >>>> >>>> > If it is SerializableCoder - attacker controls the other end of e.g. >>>> > Kafka or Pubsub that is decoding w/ ObjectInputStream - [then we could >>>> > have an allowlist or try to automatically construct an allowlist] and >>>> > otherwise there is no vulnerability for internal coders. >>>> >>>> I have never seen or heard of a user doing dynamic deserialization >>>> dispatch on ingestion, but that doesn't mean it doesn't happen. If it is >>>> important to someone then they would need a more secure solution than >>>> SerializableCoder. >>>> >>>> Side note: it would be great to provide an efficient and usable solution >>>> for the problem of wanting to dynamically dispatch serde in the middle of >>>> a pipeline. It is actually independent from being able to provide coders >>>> for a wide variety of types, which we can do a bunch of different (mostly >>>> better) ways. (has a better solution been built since the last time I >>>> thought about this?) >>>> >>>> Kenn >>>> >>>> On Mon, Mar 23, 2020 at 1:36 PM Ismaël Mejía <[email protected]> wrote: >>>>> >>>>> The link to the previous covnersation (discussion happened in private@ >>>>> and I suppose we can bring some relevant bits here if needed) >>>>> https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E >>>>> >>>>> I remember Robert had some points there, but I am not sure we >>>>> found/agreed on a solution that was relevant and did not break current >>>>> users and their user experience (like the case of blacklists). >>>>> >>>>> On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <[email protected]> wrote: >>>>> > >>>>> > Being able to have something that can encode any object (or at least a >>>>> > large class of objects) is extremely powerful so requiring >>>>> > SerializableCoder<T> to only encode T.class would hurt our users. >>>>> > >>>>> > I believe someone looked at this kind of problem before and we came to >>>>> > an agreement of usng an explicit approve/deny list on the class names >>>>> > which would address the security concern. I don't remember the thread >>>>> > though and couldn't find the thread after a few minutes of searching. >>>>> > >>>>> > >>>>> > >>>>> > On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <[email protected]> wrote: >>>>> >> >>>>> >> So you think the spec for SerializableCoder<T> (currently doesn't >>>>> >> really have one) should be that it dynamically dispatches what it >>>>> >> deserializes? I had imagined we would treat it more as a statically >>>>> >> determined coder, so because it is invariant in T we would not allow >>>>> >> up or down casts (they are unsafe). But we probably don't actually >>>>> >> have the static information to do that anyhow so you are probably >>>>> >> right. >>>>> >> >>>>> >> I wonder about the threat model here. Is this the event that the >>>>> >> runner (managed service or bespoke cluster) is compromised and is >>>>> >> attempting RCE on the Java SDK harness or runner-specific Java-based >>>>> >> worker? >>>>> >> >>>>> >> Kenn >>>>> >> >>>>> >> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <[email protected]> wrote: >>>>> >>> >>>>> >>> I don't think this is going to work since SerializableCoder<T> should >>>>> >>> be able to decode T and all objects that implement/extend T. I'm >>>>> >>> pretty sure SerializableCoder<Set/List/...> is common enough while >>>>> >>> the concrete type is HashSet/ArrayList/... >>>>> >>> I'm pretty sure there is some way you could come up with some way for >>>>> >>> making this optin though. >>>>> >>> >>>>> >>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh >>>>> >>> <[email protected]> wrote: >>>>> >>>> >>>>> >>>> Thanks Kenn. I submitted a PR here: >>>>> >>>> https://github.com/apache/beam/pull/11191 >>>>> >>>> >>>>> >>>> Colm. >>>>> >>>> >>>>> >>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <[email protected]> >>>>> >>>> wrote: >>>>> >>>>> >>>>> >>>>> I think this is fine. The same coder is used for encode and decode, >>>>> >>>>> so the Class object should be the same as well. Inheritance is not >>>>> >>>>> part of the Beam model (thank goodness) so this is a >>>>> >>>>> language-specific concern. As far as the model is concerned, the >>>>> >>>>> full URN and the payload of the coder is its identity and coders >>>>> >>>>> with different identities have no inheritance or compatibility >>>>> >>>>> relationship. Pipeline snapshot/update is an edge case, but >>>>> >>>>> changing coder is not supported by any runner I know of, and >>>>> >>>>> probably won't be until we have some rather large new ideas. >>>>> >>>>> >>>>> >>>>> Kenn >>>>> >>>>> >>>>> >>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh >>>>> >>>>> <[email protected]> wrote: >>>>> >>>>>> >>>>> >>>>>> Hi, >>>>> >>>>>> >>>>> >>>>>> I have a question on SerializableCoder. I'm looking at hardening >>>>> >>>>>> the Java Object deserialization that is taking place. We have a >>>>> >>>>>> "Class<T> type" that is used to decode the input stream: >>>>> >>>>>> >>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream); >>>>> >>>>>> return type.cast(ois.readObject()); >>>>> >>>>>> >>>>> >>>>>> What I would like to do would be something like: >>>>> >>>>>> >>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) { >>>>> >>>>>> @Override >>>>> >>>>>> protected Class<?> resolveClass(ObjectStreamClass desc) throws >>>>> >>>>>> IOException, ClassNotFoundException { >>>>> >>>>>> if (!desc.getName().equals(type.getName())) { >>>>> >>>>>> throw new InvalidClassException("Unauthorized >>>>> >>>>>> deserialization attempt", desc.getName()); >>>>> >>>>>> } >>>>> >>>>>> return super.resolveClass(desc); >>>>> >>>>>> } >>>>> >>>>>> }; >>>>> >>>>>> return type.cast(ois.readObject()); >>>>> >>>>>> >>>>> >>>>>> This would prevent a possible security hole where an attacker >>>>> >>>>>> could try to force the recipient of the input stream to >>>>> >>>>>> deserialize to a gadget class or the like for a RCE. >>>>> >>>>>> >>>>> >>>>>> The question is - does the deserialized type have to correspond >>>>> >>>>>> exactly to the supplied Class? Or is it supported that it's a base >>>>> >>>>>> type / abstract class? If the latter then my idea won't really >>>>> >>>>>> work. But if the type corresponds exactly then it should work OK. >>>>> >>>>>> >>>>> >>>>>> Thanks, >>>>> >>>>>> >>>>> >>>>>> Colm.
