Re: GSoC Project Proposal Draft: Code Generation in Serializers

Márton Balassi Sat, 19 Mar 2016 04:44:39 -0700

Thanks Gábor, now I also see it on the internal GSoC interface. I have
indicated that I wish to mentor your project, I think you can hit finalize
on your project there.


On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <xazax....@gmail.com> wrote:

> Hi,
>
> I have updated this draft to include preliminary benchmarks, mentioned the
> interaction of annotations with savepoints, extended it with a timeline,
> and some notes about scala case classes.
>
> Regards,
> Gábor
>
> On 9 March 2016 at 16:12, Gábor Horváth <xazax....@gmail.com> wrote:
>
> > Hi!
> >
> > As far as I can see the formatting was not correct in my previous mail. A
> > better formatted version is available here:
> >
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> > Sorry for that.
> >
> > Regards,
> > Gábor
> >
> > On 9 March 2016 at 15:51, Gábor Horváth <xazax....@gmail.com> wrote:
> >
> >> Hi,I did not want to send this proposal out before the I have some
> >> initial benchmarks, but this issue was mentioned on the mailing list (
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> ),
> >> and I wanted to make this information available to be able to
> incorporate
> >> this into that discussion. I have written this draft with the help of
> Gábor
> >> Gévay and Márton Balassi and I am open to every suggestion.
> >>
> >>
> >> The proposal draft:
> >> Code Generation in Serializers and Comparators of Apache Flink
> >>
> >> I am doing my last semester of my MSc studies and I’m a former GSoC
> >> student in the LLVM project. I plan to improve the serialization code in
> >> Flink during this summer. The current implementation of the serializers
> can
> >> be a performance bottleneck in some scenarios. These performance
> problems
> >> were also reported on the mailing list recently [1]. I plan to implement
> >> code generation into the serializers to improve the performance (as
> Stephan
> >> Ewen also suggested.)
> >>
> >> TODO: I plan to include some preliminary benchmarks in this section.
> >> Performance problems with the current serializers
> >>
> >>    1.
> >>
> >>    PojoSerializer uses reflection for accessing the fields, which is
> >>    slow (eg. [2])
> >>
> >>
> >>    -
> >>
> >>    This is also a serious problem for the comparators
> >>
> >>
> >>    1.
> >>
> >>    When deserializing fields of primitive types (eg. int), the reusing
> >>    overload of the corresponding field serializers cannot really do any
> reuse,
> >>    because boxed primitive types are immutable in Java. This results in
> lots
> >>    of object creations. [3][7]
> >>    2.
> >>
> >>    The loop to call the field serializers makes virtual function calls,
> >>    that cannot be speculatively devirtualized by the JVM or predicted
> by the
> >>    CPU, because different serializer subclasses are invoked for the
> different
> >>    fields. (And the loop cannot be unrolled, because the number of
> iterations
> >>    is not a compile time constant.) See also the following discussion
> on the
> >>    mailing list [1].
> >>    3.
> >>
> >>    A POJO field can have the value null, so the serializer inserts 1
> >>    byte null tags, which wastes space. (Also, the type extractor logic
> does
> >>    not distinguish between primitive types and their boxed versions, so
> even
> >>    an int field has a null tag.)
> >>    4.
> >>
> >>    Subclass tags also add a byte at the beginning of every POJO
> >>    5.
> >>
> >>    getLength() does not know the size in most cases [4]
> >>    Knowing the size of a type when serialized has numerous performance
> >>    benefits throughout Flink:
> >>    1.
> >>
> >>       Sorters can do in-place, when the type is small [5]
> >>       2.
> >>
> >>       Chaining hash tables do not need resizes, because they know how
> >>       many buckets to allocate upfront [6]
> >>       3.
> >>
> >>       Different hash table architectures could be used, eg. open
> >>       addressing with linear probing instead of some chaining
> >>       4.
> >>
> >>       It is possible to deserialize, modify, and then serialize back a
> >>       record to its original place, because it cannot happen that the
> modified
> >>       version does not fit in the place allocated there for the old
> version (see
> >>       CompactingHashTable and ReduceHashTable for concrete instances of
> this
> >>       problem)
> >>
> >>
> >> Note, that 2. and 3. are problems with not just the PojoSerializer, but
> >> also with the TupleSerializer.
> >> Solution approaches
> >>
> >>    1.
> >>
> >>    Run time code generation for every POJO
> >>
> >>
> >>    -
> >>
> >>       1. and 3 . would be automatically solved, if the serializers for
> >>       POJOs would be generated on-the-fly (by, for example, Javassist)
> >>       -
> >>
> >>       2. also needs code generation, and also some extra effort in the
> >>       type extractor to distinguish between primitive types and their
> boxed
> >>       versions
> >>       -
> >>
> >>       could be used for PojoComparator as well (which could greatly
> >>       increase the performance of sorting)
> >>
> >>
> >>    1.
> >>
> >>    Annotations on POJOs (by the users)
> >>
> >>
> >>    -
> >>
> >>       Concretely:
> >>       -
> >>
> >>          annotate fields that will never be nulls -> no null tag needed
> >>          before every field!
> >>          -
> >>
> >>          make a POJO final -> no subclass tag needed
> >>          -
> >>
> >>          annotating a POJO that it will not be null -> no top level null
> >>          tag needed
> >>          -
> >>
> >>       These would also help with the getLength problem (6.), because the
> >>       length is often not known because currently anything can be null
> or a
> >>       subclass can appear anywhere
> >>       -
> >>
> >>       These annotations could be done without code generation, but then
> >>       they would add some overhead when there are no annotations
> present, so this
> >>       would work better together with the code generation
> >>       -
> >>
> >>       Tuples would become a special case of POJOs, where nothing can be
> >>       null, and no subclass can appear, so maybe we could eliminate the
> >>       TupleSerializer
> >>       -
> >>
> >>       We could annotate some internal types in Flink libraries (Gelly
> >>       (Vertex, Edge), FlinkML)
> >>
> >>
> >> TODO: what is the situation with Scala case classes? Run time code
> >> generation is probably easier in Scala? (with quasiquotes)
> >>
> >> About me
> >>
> >> I am in the last year of my Computer Science MSc studies at Eotvos
> Lorand
> >> University in Budapest, and planning to start a PhD in the autumn. I
> have
> >> been working for almost three years at Ericsson on static analysis tools
> >> for C++. In 2014 I participated in GSoC, working on the LLVM project,
> and I
> >> am a frequent contributor ever since. The next summer I was interning at
> >> Apple.
> >>
> >> I learned about the Flink project not too long ago and I like it so far.
> >> The last few weeks I was working on some tickets to familiarize myself
> with
> >> the codebase:
> >>
> >> https://issues.apache.org/jira/browse/FLINK-3422
> >>
> >> https://issues.apache.org/jira/browse/FLINK-3322
> >>
> >> https://issues.apache.org/jira/browse/FLINK-3457
> >>
> >> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
> >> References
> >>
> >> [1]
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> >>
> >> [2]
> >>
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> >>
> >> [3]
> >>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> >>
> >> [4]
> >>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> >>
> >> [5]
> >>
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> >>
> >> [6]
> >>
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> >> [7] https://issues.apache.org/jira/browse/FLINK-3277
> >>
> >>
> >> Best Regards,
> >>
> >> Gábor
> >>
> >
> >
>

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Reply via email to