For JSON serialization of objects such as FileScanTask, I have been experimenting about the best way to achieve that. So far the best way I see is actually still to use a fully customized JSON parser like what we did for TableMetadata. This is because there are many sub-classes of the FileScanTask interface, and serialization has to be defined all the way down to some very nested objects like Expression. I don't see how that could be easily done by Immutables or Jackson annotations, but I am not very experienced in Java annotation processor frameworks. I can publish a PR for more discussions around that area.
-Jack On Wed, Dec 8, 2021 at 3:08 PM Ryan Blue <b...@tabular.io> wrote: > This could be valuable for some things, but I'm skeptical that we would > want to rewrite significant pieces of existing code this way. > > For example, you mentioned being able to "get rid of hand-written JSON > serializers/deserializers". Automatic serialization is fiddly. You end up > needing to scatter customizations around the code to annotate fields and > add conversions. Then those annotations become the source of truth for > serialization. If someone doesn't know what annotation to add to a field or > just doesn't consider serialization you can serialize either more than you > intended or something incorrect. This doesn't fit well with Iceberg, where > the serialization to JSON is documented in the spec and is strict. > > In addition to making it hard to maintain a JSON format that aligns with > the spec, this change isn't really needed. The existing parsers are already > written and don't change much. We don't have to worry about updating > dependencies and the implications to runtime Jars. But this would introduce > new (and unnecessary) risk to integrations. > > To make that risk more concrete, consider object serialization in Spark. > Today, we support Java, JSON, and Kryo serialization. There are places > where we've needed to adapt the implementations and stop storing immutable > collections for Kryo, or else Kryo can't reconstruct objects. So instead of > storing an ImmutableMap, we store a HashMap and wrap it to be unmodifiable. > If we were to rewrite classes that get serialized, we'd need to ensure > we're not causing regressions where classes become incompatible with > serialization that currently works. > > I don't think I would opt to spend time converting existing classes that > we already know work, when those already have working JSON parsers, Java > and Kryo serialization, and builders. Introducing big changes to achieve > the same functionality is a pretty significant risk and should have a > bigger upside than reducing lines of code, in my opinion. > > There may be uses other than converting existing code and there could be > low risk existing classes that we can convert. Using this to support JSON > serialization for `FileScanTask` is a good example because we need JSON > serialization for Trino, but it doesn't need to conform to a spec. > > Ryan > > On Wed, Dec 8, 2021 at 4:37 AM Eduard Tudenhoefner <edu...@dremio.com> > wrote: > >> Hello everyone, >> >> I was curious what people think about introducing/using the Immutables >> <https://immutables.github.io/> library in Iceberg. The Immutables >> library (Apache License 2.0) works via annotation processing, where an >> abstract value class is annotated with *@Value.Immutable* to generate an >> *immutable >> implementation class*. >> >> The main motivation for the Immutables library is outlined below: >> >> - Use true immutable objects that are type-safe, thread-safe, >> null-safe >> - Get builder classes for free >> - reduce amount of boiler-plate code >> - Get JSON serialization/deserialization for free (this will also be >> helpful for the REST-based Catalog >> <https://github.com/apache/iceberg/pull/3424>) since Immutable >> objects are serialization ready (including JSON and its binary forms) >> - Supports lazy, derived and optional attributes >> - Immutable objects are constructed once, in a consistent state, and >> can be safely shared >> - Will fail if mandatory attributes are missing >> - Cannot be sneakily modified when passed to other code >> - Immutable objects are naturally thread-safe and can therefore be >> safely shared among threads >> - No excessive copying >> - No excessive synchronization >> - Object definitions are pleasant to write and read >> - No boilerplate setter and getters >> - No IDE-generated hashCode, equals and toString methods required. >> However, custom implementation for these methods can still be written >> >> >> I did a small PoC in #3688 <https://github.com/apache/iceberg/pull/3688> >> (TableIdentifier/Namespace) >> and #3580 <https://github.com/apache/iceberg/pull/3580> (TableMetadata) >> but notice that both PRs still don't show the full potential of Immutables >> as they only convert objects to being Immutable + using the generated >> builders + doing some fairly easy validation checks. >> The real benefit gets more obvious when we can get rid of hand-written >> JSON serializers/deserializers. >> >> >> Eduard >> > > > -- > Ryan Blue > Tabular >