Re: supporting a "unit" field for avro schema

2019-07-15 Thread Erik Erlandson
If I'm interpreting the situation correctly, there is an "Avro Enhancement
Proposal", but none have been filed in nearly a decade:
https://cwiki.apache.org/confluence/display/AVRO/Avro+Enhancement+Proposals

As a start, I submitted a jira to track this idea:
https://issues.apache.org/jira/browse/AVRO-2474



On Mon, Jul 8, 2019 at 10:42 AM Erik Erlandson  wrote:

>
> What should I do to move this forward? Does Avro have a PIP process?
>
>
> On Sat, Jun 29, 2019 at 3:26 PM Erik Erlandson 
> wrote:
>
>>
>> Regarding schema, my proposal for fingerprints would be that units are
>> fingerprinted based on their canonical form, as defined here
>> .
>> Any two unit expressions having the same canonical form (including the
>> corresponding coefficients) are exactly equivalent, and so their
>> fingerprints can be the same. Possibly the unit could be stored on the
>> schema in canonical form by convention, although canonical forms are
>> frequently not as intuitive to humans and so in that case the documentation
>> value of the unit might be reduced for humans examining the schema.
>>
>> For schema evolution, a unit change such that the previous and new unit
>> are convertable (also defined as at the above link) would be well defined,
>> and automatic transformation would just be the correct unit conversion
>> (e.g. seconds to milliseconds). If the unit changes to a non-convertable
>> unit (e.g. seconds to bytes) then no automatic transformation exists, and
>> attempting to resolve the old and new schema would be an error. Note that
>> establishing the conversion assumes that both original and new schemas are
>> available at read time.
>>
>>
>> On Sat, Jun 29, 2019 at 11:55 AM Niels Basjes  wrote:
>>
>>> I think we should approach this idea in two parts:
>>>
>>> 1) The schema. Things like does a different unit mean a different schema
>>> fingerprint even though the bytes remain the same. What does a different
>>> unit mean for schema evolution.
>>>
>>> 2) Language specifics. Scala has different possibilities than Java.
>>>
>>> On Sat, Jun 29, 2019, 18:59 Erik Erlandson  wrote:
>>>
>>> > I've been puzzling over what can be done to support this in more
>>> > widely-used languages. The dilemma relative to the current language
>>> > ecosystem is that languages with "modern" type systems (Haskell, Rust,
>>> > Scala, etc) capable of supporting compile-time unit checking, in the
>>> > particular style I've been exploring, are not yet widely used.
>>> >
>>> > With respect to Java, a couple approaches are plausible. One is to
>>> enhance
>>> > the language, for example with Java-8 compiler plugins. Another might
>>> be to
>>> > implement a unit type system similar to squants
>>> > . This style of unit type
>>> system is
>>> > not as flexible or intuitive as what can be done with Scala's latest
>>> type
>>> > system sorcery, but it would allow the community to build out a Java
>>> native
>>> > type system that supports compile-time unit analysis. And its coverage
>>> of
>>> > standard units could be made very good, as squants itself demonstrates.
>>> >
>>> > Python would also be a high-coverage target. I'm even less sure what
>>> to do
>>> > for python, as it has no compile-time type checking, but perhaps a
>>> > squants-like python class system would add value. Maybe python's new
>>> > type-hints feature could be leveraged?
>>> >
>>> > Regarding unit expression representation, I'm not unhappy with what
>>> I've
>>> > prototyped in `coulomb-avro`, in broad strokes. It has deficiencies
>>> that
>>> > would need addressing. It doesn't yet support standard unit
>>> abbreviations,
>>> > nor does it understand plurals (e.g. it can parse "second" but not
>>> > "seconds"). Since it's "unit" field is just a custom metadata key,
>>> there is
>>> > no enforcement. Parsers are currently instantiated via explicit lists
>>> of
>>> > types, which is a property I like, but that may not work well in a
>>> world
>>> > where multiple language bindings must be supported in a portable
>>> manner.
>>> >
>>> >
>>> >
>>> > On Sat, Jun 29, 2019 at 1:46 AM Niels Basjes  wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > I attended your talk in Berlin and at the end I thought "too bad
>>> this is
>>> > > only Scala".
>>> > >
>>> > > I think it's a good idea to have this in Avro.
>>> > >
>>> > > The details will be tricky: How to encode the units in the schema for
>>> > > example.
>>> > > Especially because of the automatic conversion you spoke about.
>>> > >
>>> > > Niels
>>> > >
>>> > > On Fri, Jun 28, 2019, 23:58 Erik Erlandson 
>>> wrote:
>>> > >
>>> > > > Hi Avro community,
>>> > > >
>>> > > > Recently I have been experimenting with avro schema that are
>>> extended
>>> > > with
>>> > > > a "unit" field. By "unit" I mean expressions like "second", or
>>> > > "megabyte" -
>>> > > > that is "units of measure".
>>> > > >
>>> > > > I delivered a 

[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2019-07-15 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885691#comment-16885691
 ] 

Erik Erlandson commented on AVRO-2474:
--

(copied from email thread)

 
Regarding schema, my proposal for fingerprints would be that units are 
fingerprinted based on their canonical form, as [defined 
here|http://erikerlandson.github.io/blog/2019/05/03/algorithmic-unit-analysis/].
 Any two unit expressions having the same canonical form (including the 
corresponding coefficients) are exactly equivalent, and so their fingerprints 
can be the same. Possibly the unit could be stored on the schema in canonical 
form by convention, although canonical forms are frequently not as intuitive to 
humans and so in that case the documentation value of the unit might be reduced 
for humans examining the schema.
 
For schema evolution, a unit change such that the previous and new unit are 
convertable (also defined as at the above link) would be well defined, and 
automatic transformation would just be the correct unit conversion (e.g. 
seconds to milliseconds). If the unit changes to a non-convertable unit (e.g. 
seconds to bytes) then no automatic transformation exists, and attempting to 
resolve the old and new schema would be an error. Note that establishing the 
conversion assumes that both original and new schemas are  available at read 
time.
 

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (AVRO-2474) Support a "unit" property of schema fields

2019-07-15 Thread Erik Erlandson (JIRA)
Erik Erlandson created AVRO-2474:


 Summary: Support a "unit" property of schema fields
 Key: AVRO-2474
 URL: https://issues.apache.org/jira/browse/AVRO-2474
 Project: Apache Avro
  Issue Type: Improvement
  Components: spec
Affects Versions: 1.9.0
Reporter: Erik Erlandson


Recently I have been experimenting with avro schema that are extended with a 
"unit" field. By "unit" I mean expressions like "second", or "megabyte" - that 
is "units of measure".
 
I received some community interest in making this concept "first class" for 
avro; I'm filing this JIRA to track the idea. 
 
I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
viewed here:
[https://www.youtube.com/watch?v=qrQmB2KFKE8]
 
I also wrote a short blog post that may be faster to ingest:
[http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
 
The project itself is here:
[https://github.com/erikerlandson/coulomb]
 
 
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [jira] [Created] (AVRO-2473) C#: Fix documentation warnings

2019-07-15 Thread Brian Lachniet
Hey Patrick, thank you! I actually have a draft PR up for this now:
https://github.com/apache/avro/pull/586. I could certainly use a second
pair of eyes on my changes, if you're willing to review them.

I want to get your Reflect changes in before we try to merge these changes
in, though. I started to merge your reflect changes this past weekend but
screwed up the rebase. Check out my latest comments on your PR
 if you
haven't seen them already.

On Sun, Jul 14, 2019 at 7:14 PM Patrick Farry 
wrote:

> want some help with this?
>
> On Sun, Jul 14, 2019, 4:56 AM Brian Lachniet (JIRA) 
> wrote:
>
> > Brian Lachniet created AVRO-2473:
> > 
> >
> >  Summary: C#: Fix documentation warnings
> >  Key: AVRO-2473
> >  URL: https://issues.apache.org/jira/browse/AVRO-2473
> >  Project: Apache Avro
> >   Issue Type: Improvement
> >   Components: csharp
> > Affects Versions: 1.9.0
> > Reporter: Brian Lachniet
> > Assignee: Brian Lachniet
> >  Fix For: 1.10.0, 1.9.1
> >
> >
> > Fix the hundreds of documentation warnings in the C# project. These
> > warnings include malformed documentation as well as missing documentation
> > on public members.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v7.6.14#76016)
> >
>


-- 

[image: 51b630b05e01a6d5134ccfd520f547c4.png]

Brian Lachniet

Software Engineer

E: blachn...@gmail.com | blachniet.com 

 


Re: Should a Schema be serializable in Java?

2019-07-15 Thread Driesprong, Fokko
Correct me if I'm wrong here. But as far as I understood the way of
serializing the schema is using Avro, as it is part of the file. To avoid
confusion there should be one way of serializing.

However, I'm not sure if this is worth the hassle of not simply
implementing serializable. Also Flink there is a rather far from optimal
implementation:
https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
This converts it to JSON and back while distributing the schema to the
executors.

Cheers, Fokko

Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting :

> I can't think of a reason Schema should not implement Serializable.
>
> There's actually already an issue & patch for this:
>
> https://issues.apache.org/jira/browse/AVRO-1852
>
> Doug
>
> On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía  wrote:
>
> > +dev@avro.apache.org
> >
> > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba  wrote:
> > >
> > > Hello!
> > >
> > > I'm looking for any discussion or reference why the Schema object isn't
> > serializable -- I'm pretty sure this must have already been discussed
> (but
> > the keywords +avro +serializable +schema have MANY results in all the
> > searches I did: JIRA, stack overflow, mailing list, web)
> > >
> > > In particular, I was at a demo today where we were asked why Schemas
> > needed to be passed as strings to run in distributed tasks.  I remember
> > running into this problem years ago with MapReduce, and again in Spark,
> and
> > again in Beam...
> > >
> > > Is there any downside to making a Schema implement
> > java.lang.Serializable?  The only thing I can think of is that the schema
> > _should not_ be serialized with the data, and making it non-serializable
> > loosely enforces this (at the cost of continually writing different
> > flavours of "Avro holders" for when you really do want to serialize it).
> > >
> > > Willing to create a JIRA and work on the implementation, of course!
> > >
> > > All my best, Ryan
> >
>


Re: Should a Schema be serializable in Java?

2019-07-15 Thread Doug Cutting
I can't think of a reason Schema should not implement Serializable.

There's actually already an issue & patch for this:

https://issues.apache.org/jira/browse/AVRO-1852

Doug

On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía  wrote:

> +dev@avro.apache.org
>
> On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba  wrote:
> >
> > Hello!
> >
> > I'm looking for any discussion or reference why the Schema object isn't
> serializable -- I'm pretty sure this must have already been discussed (but
> the keywords +avro +serializable +schema have MANY results in all the
> searches I did: JIRA, stack overflow, mailing list, web)
> >
> > In particular, I was at a demo today where we were asked why Schemas
> needed to be passed as strings to run in distributed tasks.  I remember
> running into this problem years ago with MapReduce, and again in Spark, and
> again in Beam...
> >
> > Is there any downside to making a Schema implement
> java.lang.Serializable?  The only thing I can think of is that the schema
> _should not_ be serialized with the data, and making it non-serializable
> loosely enforces this (at the cost of continually writing different
> flavours of "Avro holders" for when you really do want to serialize it).
> >
> > Willing to create a JIRA and work on the implementation, of course!
> >
> > All my best, Ryan
>


Re: Should a Schema be serializable in Java?

2019-07-15 Thread Ismaël Mejía
+dev@avro.apache.org

On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba  wrote:
>
> Hello!
>
> I'm looking for any discussion or reference why the Schema object isn't 
> serializable -- I'm pretty sure this must have already been discussed (but 
> the keywords +avro +serializable +schema have MANY results in all the 
> searches I did: JIRA, stack overflow, mailing list, web)
>
> In particular, I was at a demo today where we were asked why Schemas needed 
> to be passed as strings to run in distributed tasks.  I remember running into 
> this problem years ago with MapReduce, and again in Spark, and again in 
> Beam...
>
> Is there any downside to making a Schema implement java.lang.Serializable?  
> The only thing I can think of is that the schema _should not_ be serialized 
> with the data, and making it non-serializable loosely enforces this (at the 
> cost of continually writing different flavours of "Avro holders" for when you 
> really do want to serialize it).
>
> Willing to create a JIRA and work on the implementation, of course!
>
> All my best, Ryan