I like the idea to automatically figure out which types are used by a program and to register them at Kryo. Thus, +1 for this idea.
On Tue, Jan 20, 2015 at 11:34 AM, Robert Metzger <rmetz...@apache.org> wrote: > @Stephan: Yes, you are summarizing it correctly. > I'll assign FLINK-1417 to myself and implement it as discussed here (once I > have resolved the other issues assigned to me) > > There is one additional point we forgot in the discussion so far: We are > initializing Kryo with twitter/chill's "ScalaKryoInstantiator()". I just > checked and its registering 51 default serializers and 81 registered types. > > I just talked to Till (who implemented the KryoSerializer originally) and > he also suggests to pool kryo instances using thread-local variables. I'll > look into that as well once FLINK-1417 has been resolved. I think that > helps to mitigate the heavy initialization of Kryo (which is inevitable) > > > @Arvid: Our POJO/Types support does explicitly not require our users to > implement any interfaces, so that option is not feasible. > The bytecode analysis is not part of the main flink distribution because it > is using a library with an apache incompatible license. > > > > On Tue, Jan 20, 2015 at 9:58 AM, Arvid Heise <arvid.he...@gmail.com> > wrote: > > > An alternative way that may not be applicable in your case: > > > > For Sopremo, all types implemented a common interface. When a package is > > loaded, the Sopremo package manager scans the jar and looks for classes > > implementing the interfaces (quite fast, because not the entire class > must > > be loaded). All types implementing the interface are automatically added > to > > Kryo with their respective annotated serializers. > > > > If you still have bytecode analysis of jobs, you can also statically > > determine all types that are used, check for default serializers, and > > maintain only a minimal set of serializers used for this specific job. > You > > could already warn for unregistered types before submitting jobs. > > > > On Tue, Jan 20, 2015 at 6:38 AM, Stephan Ewen <se...@apache.org> wrote: > > > > > Yes, I agree that the Avro serializer should be available by default. > > That > > > is one case of a typical type that should work out of the box, given > that > > > we support Avro file formats. > > > > > > Let me summarize how I understood that suggestion: > > > > > > - We make Avro available by default by registering a default > serializer > > > for the SpecificBase > > > > > > - We create a library of serializers. We do not register them by > > default. > > > > > > - Via FLINK-1417, we analyze the types. For any (nested) type that we > > > encounter for which we have a serializer in the library, we register > that > > > serializer as the default serializer. Also, for every (nested) type we > > > encounter, we register a tag at Kryo. > > > > > > I like that, it should give a nice and smooth user experience. > > > > > > Greetings, > > > Stephan > > > > > > > > > > > > > > > On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger <rmetz...@apache.org> > > > wrote: > > > > > > > Hi, > > > > > > > > thank you for putting our discussion to the mailing list. This is > > indeed > > > > where such discussions belong. For the others, we started discussing > > > here: > > > > https://github.com/apache/flink/pull/304 > > > > > > > > I think there is one additional approach, which is probably close to > > (1): > > > > We only register those serializers by default which we don't see in > the > > > > pre-flight phase (I think right now thats only GenericData.Array from > > > > Avro). > > > > We would come across all the other classes (Jodatime, Protobuf, Avro, > > > > Thrift, ...) when traversing the class hierarchy, as proposed in > > > > FLINK-1417. With this approach, users get the best out-of-the box > > > > experience and the number of registered classes / serializers is kept > > at > > > a > > > > minimum. > > > > We can still offer means to register additional serializers (I think > > > thats > > > > already merged to master). > > > > > > > > My main concern with this particular issue is a good out of the box > > user > > > > experience. If there is an issue with type serialization, users will > > > notice > > > > it very early. (In my experience people often have their existing > > > datatypes > > > > they use with other systems, and they want to continue using them) > > > > Therefore, I want to put some effort into making it as good as > > possible. > > > I > > > > would actually sacrifice performance over stability/usability here. > Our > > > > system is flexible enough to replace it later with a more efficient > > > > serialization if that becomes an issue. But maybe my suggestion above > > is > > > > already sufficient. > > > > > > > > We could also think about introducing a configuration variable which > > > allows > > > > users to disable the default serializers. > > > > > > > > > > > > Regarding the second question: Is there a downside registering all > > types > > > > for tagging? We reduce the overall I/O which is good for performance. > > > > > > > > Best, > > > > Robert > > > > > > > > > > > > > > > > On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <se...@apache.org> > > wrote: > > > > > > > > > Hi all! > > > > > > > > > > We have various pending pull requests that add support for certain > > > types > > > > by > > > > > adding extra kryo serializers. > > > > > > > > > > I think we need to decide how we want to handle the support for > extra > > > > > types, because more are certainly to come. > > > > > > > > > > As I understand it, we have three broad options: > > > > > > > > > > (1) > > > > > Add as many serializers to Kryo by default as possible. > > > > > Pro: > > > > > - Many types work out of the box > > > > > Contra: > > > > > - We may eventually overload the kryo registry with serializers > > > > > that are not needed for most cases and suffer in performance > > > > > - It is hard to guess which types work out of the box > > > (intransparent) > > > > > > > > > > > > > > > (2) > > > > > We create a collection of serializers and a registration util. > > > > > -------- > > > > > val env = ExecutionEnvironemnt.getExecutionEnviroment() > > > > > > > > > > Serializers.registerProtoBufSerializers(env); > > > > > Serializers.registerJavaUtilSerializers(env); > > > > > --------- > > > > > Pro: > > > > > - Easy for users > > > > > - We can grow the set of supported types very large without > > > overloading > > > > > Kryo > > > > > - It is transparent what gets registered > > > > > > > > > > Contra: > > > > > - Not quite as convenient as if things just run > > > > > > > > > > > > > > > (3) > > > > > We do nothing and let the user create and register whatever is > > needed. > > > > > > > > > > We could have a library and utility for serializers for certain > > > > libraries. > > > > > Users could use this to conveniently add serializers for the > > libraries > > > > they > > > > > use. > > > > > Pro: > > > > > - Simple for us ;-) > > > > > Contra: > > > > > - More repeated work for users > > > > > > > > > > > > > > > ======================== > > > > > > > > > > For approach (1) and (2), there is an orthogonal question of > whether > > we > > > > > want to simply register default serializers (that enable that types > > > work) > > > > > or also register types for tags, to speed up the serialization of > > those > > > > > types. > > > > > > > > > > > > > > > Greetings, > > > > > Stephan > > > > > > > > > > > > > > >