Re: Should a Schema be serializable in Java?

2019-07-16 Thread Ryan Skraba
Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was
looking for.

I agree that Java serialization shouldn't be used for anything
cross-platform, or (in my opinion) used for any *data* persistence at all.
Especially not for an Avro container file or sending binary data through a
messaging system...

But Java serialization is definitely useful and used for sending instances
of "distributed work" implemented in Java from node to node in a cluster.
I'm not too worried about existing connectors -- we can see that each
framework has "solved" the problem one at a time.  In addition to Flink,
there's
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29
 and
https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35
.

Specifically, I see the advantage for user-defined distributed functions
that happen to carry along an Avro Schema -- and I can personally say that
I've encountered this a lot in our code!

That being said, I think it's probably overkill to warn the user about the
perils of Java serialization (not being cross-language and requiring
consistent JDKs and libraries across JVMs).  If an error occurs for one of
those reasons, there's a larger problem for the dev to address, and it's
just as likely to occur for any Java library in the job if the environment
is bad.  Related, we've encountered similar issues with logical types
existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster...
the solution is "make sure you don't do that".  (Looking at you, guava and
jackson!)

The patch in question delegates serialization to the string form of the
schema, so it's basically doing what all of the above Avro "holders" are
doing -- I wouldn't object to having a sample schema available that fully
exercises what a schema can hold, but I also think that Schema.Parser (used
underneath) is currently pretty well tested and mature!

Do you think this could be a candidate for 1.9.1 as a minor improvement?  I
can't think of any reason that this wouldn't be backwards compatible.

Ryan

side note: I wrote java.lang.Serializable earlier, which probably didn't
help my search for prior discussion... :/

On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía  wrote:

> This is a good idea even if it may have some issues that we should
> probably document and warn users about:
>
> 1. Java based serialization is really practical for JVM based systems,
> but we should probably add a warning or documentation because Java
> serialization is not deterministic between JVMs so this could be a
> source for issues (usually companies use the same version of the JVM
> so this is less critical, but this still can happen specially now with
> all the different versions of Java and OpenJDK based flavors).
>
> 2. This is not cross language compatible, the String based
> representation (or even an Avro based representation of Schema) can be
> used in every language.
>
> Even with these I think just for ease of use it is worth to make
> Schema Serializable. Is the plan to fully serialize it, or just to
> make it a String and serialize the String as done in the issue Doug
> mentioned?
> If we take the first approach we need to properly test with a Schema
> that has elements of the full specification that (de)-serialization
> works correctly. Does anyone know if we have already a test schema
> that covers the full ‘schema’ specification to reuse it if so?
>
> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko 
> wrote:
> >
> > Correct me if I'm wrong here. But as far as I understood the way of
> > serializing the schema is using Avro, as it is part of the file. To avoid
> > confusion there should be one way of serializing.
> >
> > However, I'm not sure if this is worth the hassle of not simply
> > implementing serializable. Also Flink there is a rather far from optimal
> > implementation:
> >
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
> > This converts it to JSON and back while distributing the schema to the
> > executors.
> >
> > Cheers, Fokko
> >
> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting :
> >
> > > I can't think of a reason Schema should not implement Serializable.
> > >
> > > There's actually already an issue & patch for this:
> > >
> > > https://issues.apache.org/jira/browse/AVRO-1852
> > >
> > > Doug
> > >
> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía 
> wrote:
> > >
> > > > +dev@avro.apache.org
> > > >
> > > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba  wrote:
> > > > >
> > > > > Hello!
> > > > >
> > > > > I'm looking for any discussion or reference why the Schema object
> isn't
> > > > serializable -- I'm pretty sure this must have already been discussed
> > > (but
> > > > the keywords +avro +serializable +schema have MANY results in 

Re: Should a Schema be serializable in Java?

2019-07-16 Thread Ismaël Mejía
This is a good idea even if it may have some issues that we should
probably document and warn users about:

1. Java based serialization is really practical for JVM based systems,
but we should probably add a warning or documentation because Java
serialization is not deterministic between JVMs so this could be a
source for issues (usually companies use the same version of the JVM
so this is less critical, but this still can happen specially now with
all the different versions of Java and OpenJDK based flavors).

2. This is not cross language compatible, the String based
representation (or even an Avro based representation of Schema) can be
used in every language.

Even with these I think just for ease of use it is worth to make
Schema Serializable. Is the plan to fully serialize it, or just to
make it a String and serialize the String as done in the issue Doug
mentioned?
If we take the first approach we need to properly test with a Schema
that has elements of the full specification that (de)-serialization
works correctly. Does anyone know if we have already a test schema
that covers the full ‘schema’ specification to reuse it if so?

On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko  wrote:
>
> Correct me if I'm wrong here. But as far as I understood the way of
> serializing the schema is using Avro, as it is part of the file. To avoid
> confusion there should be one way of serializing.
>
> However, I'm not sure if this is worth the hassle of not simply
> implementing serializable. Also Flink there is a rather far from optimal
> implementation:
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
> This converts it to JSON and back while distributing the schema to the
> executors.
>
> Cheers, Fokko
>
> Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting :
>
> > I can't think of a reason Schema should not implement Serializable.
> >
> > There's actually already an issue & patch for this:
> >
> > https://issues.apache.org/jira/browse/AVRO-1852
> >
> > Doug
> >
> > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía  wrote:
> >
> > > +dev@avro.apache.org
> > >
> > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba  wrote:
> > > >
> > > > Hello!
> > > >
> > > > I'm looking for any discussion or reference why the Schema object isn't
> > > serializable -- I'm pretty sure this must have already been discussed
> > (but
> > > the keywords +avro +serializable +schema have MANY results in all the
> > > searches I did: JIRA, stack overflow, mailing list, web)
> > > >
> > > > In particular, I was at a demo today where we were asked why Schemas
> > > needed to be passed as strings to run in distributed tasks.  I remember
> > > running into this problem years ago with MapReduce, and again in Spark,
> > and
> > > again in Beam...
> > > >
> > > > Is there any downside to making a Schema implement
> > > java.lang.Serializable?  The only thing I can think of is that the schema
> > > _should not_ be serialized with the data, and making it non-serializable
> > > loosely enforces this (at the cost of continually writing different
> > > flavours of "Avro holders" for when you really do want to serialize it).
> > > >
> > > > Willing to create a JIRA and work on the implementation, of course!
> > > >
> > > > All my best, Ryan
> > >
> >


Re: Should a Schema be serializable in Java?

2019-07-15 Thread Driesprong, Fokko
Correct me if I'm wrong here. But as far as I understood the way of
serializing the schema is using Avro, as it is part of the file. To avoid
confusion there should be one way of serializing.

However, I'm not sure if this is worth the hassle of not simply
implementing serializable. Also Flink there is a rather far from optimal
implementation:
https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
This converts it to JSON and back while distributing the schema to the
executors.

Cheers, Fokko

Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting :

> I can't think of a reason Schema should not implement Serializable.
>
> There's actually already an issue & patch for this:
>
> https://issues.apache.org/jira/browse/AVRO-1852
>
> Doug
>
> On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía  wrote:
>
> > +dev@avro.apache.org
> >
> > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba  wrote:
> > >
> > > Hello!
> > >
> > > I'm looking for any discussion or reference why the Schema object isn't
> > serializable -- I'm pretty sure this must have already been discussed
> (but
> > the keywords +avro +serializable +schema have MANY results in all the
> > searches I did: JIRA, stack overflow, mailing list, web)
> > >
> > > In particular, I was at a demo today where we were asked why Schemas
> > needed to be passed as strings to run in distributed tasks.  I remember
> > running into this problem years ago with MapReduce, and again in Spark,
> and
> > again in Beam...
> > >
> > > Is there any downside to making a Schema implement
> > java.lang.Serializable?  The only thing I can think of is that the schema
> > _should not_ be serialized with the data, and making it non-serializable
> > loosely enforces this (at the cost of continually writing different
> > flavours of "Avro holders" for when you really do want to serialize it).
> > >
> > > Willing to create a JIRA and work on the implementation, of course!
> > >
> > > All my best, Ryan
> >
>


Re: Should a Schema be serializable in Java?

2019-07-15 Thread Doug Cutting
I can't think of a reason Schema should not implement Serializable.

There's actually already an issue & patch for this:

https://issues.apache.org/jira/browse/AVRO-1852

Doug

On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía  wrote:

> +dev@avro.apache.org
>
> On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba  wrote:
> >
> > Hello!
> >
> > I'm looking for any discussion or reference why the Schema object isn't
> serializable -- I'm pretty sure this must have already been discussed (but
> the keywords +avro +serializable +schema have MANY results in all the
> searches I did: JIRA, stack overflow, mailing list, web)
> >
> > In particular, I was at a demo today where we were asked why Schemas
> needed to be passed as strings to run in distributed tasks.  I remember
> running into this problem years ago with MapReduce, and again in Spark, and
> again in Beam...
> >
> > Is there any downside to making a Schema implement
> java.lang.Serializable?  The only thing I can think of is that the schema
> _should not_ be serialized with the data, and making it non-serializable
> loosely enforces this (at the cost of continually writing different
> flavours of "Avro holders" for when you really do want to serialize it).
> >
> > Willing to create a JIRA and work on the implementation, of course!
> >
> > All my best, Ryan
>


Re: Should a Schema be serializable in Java?

2019-07-15 Thread Ismaël Mejía
+dev@avro.apache.org

On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba  wrote:
>
> Hello!
>
> I'm looking for any discussion or reference why the Schema object isn't 
> serializable -- I'm pretty sure this must have already been discussed (but 
> the keywords +avro +serializable +schema have MANY results in all the 
> searches I did: JIRA, stack overflow, mailing list, web)
>
> In particular, I was at a demo today where we were asked why Schemas needed 
> to be passed as strings to run in distributed tasks.  I remember running into 
> this problem years ago with MapReduce, and again in Spark, and again in 
> Beam...
>
> Is there any downside to making a Schema implement java.lang.Serializable?  
> The only thing I can think of is that the schema _should not_ be serialized 
> with the data, and making it non-serializable loosely enforces this (at the 
> cost of continually writing different flavours of "Avro holders" for when you 
> really do want to serialize it).
>
> Willing to create a JIRA and work on the implementation, of course!
>
> All my best, Ryan