For JSON serialization of objects such as FileScanTask, I have been
experimenting about the best way to achieve that. So far the best way I see
is actually still to use a fully customized JSON parser like what we did
for TableMetadata. This is because there are many sub-classes of the
FileScanTask interface, and serialization has to be defined all the way
down to some very nested objects like Expression. I don't see how that
could be easily done by Immutables or Jackson annotations, but I am not
very experienced in Java annotation processor frameworks. I can publish a
PR for more discussions around that area.

-Jack

On Wed, Dec 8, 2021 at 3:08 PM Ryan Blue <b...@tabular.io> wrote:

> This could be valuable for some things, but I'm skeptical that we would
> want to rewrite significant pieces of existing code this way.
>
> For example, you mentioned being able to "get rid of hand-written JSON
> serializers/deserializers". Automatic serialization is fiddly. You end up
> needing to scatter customizations around the code to annotate fields and
> add conversions. Then those annotations become the source of truth for
> serialization. If someone doesn't know what annotation to add to a field or
> just doesn't consider serialization you can serialize either more than you
> intended or something incorrect. This doesn't fit well with Iceberg, where
> the serialization to JSON is documented in the spec and is strict.
>
> In addition to making it hard to maintain a JSON format that aligns with
> the spec, this change isn't really needed. The existing parsers are already
> written and don't change much. We don't have to worry about updating
> dependencies and the implications to runtime Jars. But this would introduce
> new (and unnecessary) risk to integrations.
>
> To make that risk more concrete, consider object serialization in Spark.
> Today, we support Java, JSON, and Kryo serialization. There are places
> where we've needed to adapt the implementations and stop storing immutable
> collections for Kryo, or else Kryo can't reconstruct objects. So instead of
> storing an ImmutableMap, we store a HashMap and wrap it to be unmodifiable.
> If we were to rewrite classes that get serialized, we'd need to ensure
> we're not causing regressions where classes become incompatible with
> serialization that currently works.
>
> I don't think I would opt to spend time converting existing classes that
> we already know work, when those already have working JSON parsers, Java
> and Kryo serialization, and builders. Introducing big changes to achieve
> the same functionality is a pretty significant risk and should have a
> bigger upside than reducing lines of code, in my opinion.
>
> There may be uses other than converting existing code and there could be
> low risk existing classes that we can convert. Using this to support JSON
> serialization for `FileScanTask` is a good example because we need JSON
> serialization for Trino, but it doesn't need to conform to a spec.
>
> Ryan
>
> On Wed, Dec 8, 2021 at 4:37 AM Eduard Tudenhoefner <edu...@dremio.com>
> wrote:
>
>> Hello everyone,
>>
>> I was curious what people think about introducing/using the Immutables
>> <https://immutables.github.io/> library in Iceberg. The Immutables
>> library (Apache License 2.0) works via annotation processing, where an
>> abstract value class is annotated with *@Value.Immutable* to generate an 
>> *immutable
>> implementation class*.
>>
>> The main motivation for the Immutables library is outlined below:
>>
>>    - Use true immutable objects that are type-safe, thread-safe,
>>    null-safe
>>    - Get builder classes for free
>>    - reduce amount of boiler-plate code
>>    - Get JSON serialization/deserialization for free (this will also be
>>    helpful for the REST-based Catalog
>>    <https://github.com/apache/iceberg/pull/3424>) since Immutable
>>    objects are serialization ready (including JSON and its binary forms)
>>    - Supports lazy, derived and optional attributes
>>    - Immutable objects are constructed once, in a consistent state, and
>>    can be safely shared
>>       - Will fail if mandatory attributes are missing
>>       - Cannot be sneakily modified when passed to other code
>>    - Immutable objects are naturally thread-safe and can therefore be
>>    safely shared among threads
>>       - No excessive copying
>>       - No excessive synchronization
>>    - Object definitions are pleasant to write and read
>>       - No boilerplate setter and getters
>>       - No IDE-generated hashCode, equals and toString methods required.
>>       However, custom implementation for these methods can still be written
>>
>>
>> I did a small PoC in #3688 <https://github.com/apache/iceberg/pull/3688> 
>> (TableIdentifier/Namespace)
>> and #3580 <https://github.com/apache/iceberg/pull/3580> (TableMetadata)
>> but notice that both PRs still don't show the full potential of Immutables
>> as they only convert objects to being Immutable + using the generated
>> builders + doing some fairly easy validation checks.
>> The real benefit gets more obvious when we can get rid of hand-written
>> JSON serializers/deserializers.
>>
>>
>> Eduard
>>
>
>
> --
> Ryan Blue
> Tabular
>

Reply via email to