Re: Alternative use of record field default values?
If you use Record builders you will currently get this behavior in the java implementation[1]. AFAICT, there's no builder equivalent in the python implementation yet. In python maybe we can skip having a builder because we can distinguish between key maps to None from Dict does not contain key. Does that sound reasonable? Care to file a ticket and maybe propose a patch? [1]: http://avro.apache.org/docs/1.7.7/api/java/org/apache/avro/data/RecordBuilder.html or more likely the generic implementation: http://avro.apache.org/docs/1.7.7/api/java/org/apache/avro/generic/GenericRecordBuilder.html or an example of the builder in generated specific code: http://avro.apache.org/docs/current/gettingstartedjava.html#Creating+Users On Mon, Aug 11, 2014 at 2:33 PM, Jeno I. Hajdu jeno.i.ha...@gmail.com wrote: Hi, my understanding of the field default values (based on the spec) is that it is solely for filling in fields present in the reader schema, but missing in the writer schema, thus defaults only make sense in reader schemas. In addition to that couldn't defaults be used on the writer side (defined in the writer schema) to fill in fields with missing values? So if I have a record schema with 100 fields, all having defaults, I could specify only 5 field values for a record to be serialized and the Avro lib would fill in the rest. This does not impact the serialization (format) itself, the spec would only allow using defaults for this purpose (and for example adding this support to the python implementation takes 2 extra lines based on a quick trial). What do you think? Would this go against Avro's philosophy? Thanks and Regards, Jeno -- Sean
[jira] [Commented] (AVRO-1047) Generated Java classes for specific records contain unchecked casts
[ https://issues.apache.org/jira/browse/AVRO-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094081#comment-14094081 ] Jurriaan Mous commented on AVRO-1047: - Can this patch be added to the next release? I have an Avro schema builder in my grade build chain and now I can't distinguish unchecked warnings from my own code which I want to keep clean of warnings and the schema generation. Many javac versions don't work with all but do with unchecked. Maybe to be entirely sure @SuppressWarnings(all,unchecked) can be used Generated Java classes for specific records contain unchecked casts --- Key: AVRO-1047 URL: https://issues.apache.org/jira/browse/AVRO-1047 Project: Avro Issue Type: Bug Components: java Affects Versions: 1.6.3 Reporter: Garrett Wu Attachments: AVRO-1047.patch, suppress-warnings.tar.gz The generated Java classes for specific records cause compiler warnings using Oracle/Sun Java 1.6, since it doesn't support @SuppressWarnings(all). Instead could we change it to @SuppressWarnings(unchecked)? Only unchecked and deprecation are mentioned Java Language Specification -- the rest are specific to compiler vendors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (AVRO-739) Add Date/Time data types
[ https://issues.apache.org/jira/browse/AVRO-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094121#comment-14094121 ] Dmitry Kovalev commented on AVRO-739: - bq. But what I'm trying to get at is whether your IPC use case for the string representations could be solved another way. In short - of course it could, just in a more laborious way. bq. Maybe I'm wrong about this, but it seems like using strings would probably be most helpful in debugging the application. And if that's the case, we can provide a few simple tools for working with these types rather than changing the representation to avoid the conversion. What about adding a set of helpers... Having these in Avro distribution would certainly encourage more people to go with binary representations if they are going to be the only standard, although this level of support is of course not the same - e.g. when you use toString() to dump the object as JSON or introspect an object in a debugger you will still see just a byte sequence. Other binary-encoded types are mostly first-class primitives which get translated to strings by standard tools so this is not an issue. However I used the debugging as just one illustration of why I thought it would be worth having standardised string representations where compactness and performance are not absolutely critical. Another reason we have also touched above is that there is a real lack of common _binary_ representations (and platform support) of anything beyond simple timestamps and dates, and this is what made people misuse e.g. Date to confuse utc/local/zoned time, fixed duration vs duration in months/days etc. Even in this spec we don't have a separate type/binary representation of local date+time - only separate types for each component - so undoubtedly some people will decide to use timestamp-millis, despite the spec saying that it represents UTC date-time explicitly. And the representation specified for Duration may be most efficient but is not something that can be called commonly used or easy to interpret. If you remember the issue of higher-precision time we have omitted in the spec - is it going to have a separate binary representation as well? ISO-8601 provides a basis to represent all of these naturally, in a way instantly understandable by human, and makes it easy to standardise different types of date-time information and promote their correct usage, and also provide a bridge to binary representations. Having said that, I absolutely don't insist on including these into spec - just attempted to explain the reasons I am using them currently and have initially suggested them. Add Date/Time data types Key: AVRO-739 URL: https://issues.apache.org/jira/browse/AVRO-739 Project: Avro Issue Type: New Feature Components: spec Reporter: Jeff Hammerbacher Fix For: 1.7.8 Attachments: AVRO-739-datetime-spec.xml.patch, AVRO-739.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (AVRO-739) Add Date/Time data types
[ https://issues.apache.org/jira/browse/AVRO-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094128#comment-14094128 ] Dmitry Kovalev commented on AVRO-739: - bq. I think the right way to handle this is to use the zone-independent date/time types and an application-level zone implementation. These cases aren't very common, as you noted, and I think having a timestamp with zone logical type allows people to get around best practices and doesn't deliver a better solution for people that actually need to represent the zone. It may be slightly easier to represent the type in a single field, but size is significantly larger and the value only has significance when interpreted at the application layer anyway. In environments providing rich support for date-time related types (such as Joda Time / Noda time), this actually translates directly into the likes of ZonedDateTime, and can be handled on Avro level, e.g. using Specific the generated objects can expose ZonedDateTime properties instead of strings. This is what I do so it does deliver a better solution for me. Happy to drop it from the spec anyway. Add Date/Time data types Key: AVRO-739 URL: https://issues.apache.org/jira/browse/AVRO-739 Project: Avro Issue Type: New Feature Components: spec Reporter: Jeff Hammerbacher Fix For: 1.7.8 Attachments: AVRO-739-datetime-spec.xml.patch, AVRO-739.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (AVRO-739) Add Date/Time data types
[ https://issues.apache.org/jira/browse/AVRO-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Kovalev updated AVRO-739: Attachment: AVRO-739-datetime-spec.xml.patch Attaching a revised patch which fixes timestamp sorting and duration endianness issues. With regard to keeping string representations/zoned types - if I'm not missing anyone, so far we have basically 1 vote for keeping them and 2 votes against. If nobody else votes, all that needs to be done is to remove the bits about string representations from this patch. Comments about adding local date-time and High-precision time in addition to timestamp-millis are welcome. Add Date/Time data types Key: AVRO-739 URL: https://issues.apache.org/jira/browse/AVRO-739 Project: Avro Issue Type: New Feature Components: spec Reporter: Jeff Hammerbacher Fix For: 1.7.8 Attachments: AVRO-739-datetime-spec.xml.patch, AVRO-739-datetime-spec.xml.patch, AVRO-739.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (AVRO-739) Add Date/Time data types
[ https://issues.apache.org/jira/browse/AVRO-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094341#comment-14094341 ] Ryan Blue commented on AVRO-739: Good point about not being able to use conversion methods in situations like debugging. But, I think I'd rather not have those limitations dictate the possible representations when we'll end up with more to support and wasteful formats. You also mention using specific objects with ZonedDateTime fields -- that addresses this problem by deserializing to a form that has a meaningful toString representation, right? Maybe we should encourage that approach. bq. Having said that, I absolutely don't insist on including these into spec - just attempted to explain the reasons I am using them currently and have initially suggested them. Same here, I don't want to insist on anything. I just want to find a good solution. bq. Comments about adding local date-time and High-precision time in addition to timestamp-millis are welcome. For high-precision, what granularity do you think needs to be supported? Nanos? Micros? We didn't have a clear answer on the Parquet side, which is why we pushed high-precision from the original spec -- better to get some of the types in and expand later. Maybe we should open a follow-up issue to discuss these? Add Date/Time data types Key: AVRO-739 URL: https://issues.apache.org/jira/browse/AVRO-739 Project: Avro Issue Type: New Feature Components: spec Reporter: Jeff Hammerbacher Fix For: 1.7.8 Attachments: AVRO-739-datetime-spec.xml.patch, AVRO-739-datetime-spec.xml.patch, AVRO-739.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (AVRO-680) Allow for non-string keys
[ https://issues.apache.org/jira/browse/AVRO-680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094384#comment-14094384 ] Sachin Goyal commented on AVRO-680: --- Also, I tried an schema1.equals(schema2) check for non-string map-keys and it is working well. {code} @Test public void testSchemaEquality () throws Exception { Schema s1 = (new ReflectData()).getSchema(Company2.class); Schema s2 = (new ReflectData()).getSchema(Company2.class); assertEquals (s1, s2); } {code} Do you see a use case where isMap should return false for non-string-map-key and isArray should return true? Allow for non-string keys - Key: AVRO-680 URL: https://issues.apache.org/jira/browse/AVRO-680 Project: Avro Issue Type: Improvement Affects Versions: 1.7.6, 1.7.7 Reporter: Jeremy Hanna Attachments: AVRO-680.patch, non_string_map_keys.zip, non_string_map_keys2.zip, non_string_map_keys3.zip, non_string_map_keys4.patch, non_string_map_keys5.patch, non_string_map_keys6.patch Based on an email thread back in April, Doug Cutting proposed a possible solution for having non-string keys: Stu Hood wrote: I can understand the reasoning behind AVRO-9, but now I need to look for an alternative to a 'map' that will allow me to store an association of bytes keys to values. A map of Foo has the same binary format as an array of records, each with a string field and a Foo field. So an application can use an array schema similar to this to represent map-like structures with, e.g., non-string keys. Perhaps we could establish standard properties that indicate that a given array of records should be represented in a map-like way if possible? E.g.,: {type: array, isMap: true, items: {type:record, ...}} Doug -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Release Avro 1.8.0 soon?
I have put up a question for AVRO-680 (non-string map keys) and AVRO-1554 (support for UUID and Date-types). Will appreciate if someone could answer those. Thanks Sachin On Tue, Aug 5, 2014 at 7:13 AM, Christophe Taton ta...@wibidata.com wrote: I'd be happy to see AVRO-1334 and AVRO-1550 checked in. C. On Tue, Aug 5, 2014 at 6:40 AM, Willem van Bergen wil...@vanbergen.org wrote: We also previously discussed dropping support for Ruby 1.8, as it is EOL. I created a patch that removes some of the hacks needed to support string encodings in Ruby 1.8 and 1.9+ simultaneously: https://issues.apache.org/jira/browse/AVRO-1559 Willem On Aug 4, 2014, at 6:31 PM, Doug Cutting cutt...@apache.org wrote: This introduces a minor incompatibility, so needs to go into 1.8.0, not 1.7.8. Perhaps we should identify other changes that also introduce minor incompatibilities and push out a 1.8.0 release soon? I don't want to open the gates to major incompatibilities, but a few whose incompatibilities that are high value and are relatively easy to diagnose might be reasonable. Other obvious candidates might be: - AVRO-1334 (update java dependencies) - AVRO-1550 (update protobuf dependency) - AVRO-1514 (update perl dependencies) What do others think? Doug On Mon, Aug 4, 2014 at 2:50 PM, Sean Busbey (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/AVRO-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated AVRO-997: - Status: Patch Available (was: In Progress) Union of enum and null cannot be serialized --- Key: AVRO-997 URL: https://issues.apache.org/jira/browse/AVRO-997 Project: Avro Issue Type: Bug Affects Versions: 1.5.1 Reporter: Aaron Kimball Assignee: Sean Busbey Fix For: 1.8.0 Attachments: AVRO-997.patch, AVRO-997.patch, AVRO-997.patch, AVRO-997.permissive-generic-api.patch I have a schema like: {code} [ { type: enum, name: Gender, symbols: [M, F] }, { type : record, name : Foo, fields : [ { type : [Gender, null], name : gender }, ... ] } ] {code} I build a record like {{Foo foo = new Foo(); foo.gender = Gender.M;}} When I go to serialize this, I get: {code}Not in union [{type:enum,name:Gender,symbols:[M,F]},null]: M at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:482) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:70) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:65) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:57) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (AVRO-1566) Use field default value to fill in missing values when serializing records
Jeno I. Hajdu created AVRO-1566: --- Summary: Use field default value to fill in missing values when serializing records Key: AVRO-1566 URL: https://issues.apache.org/jira/browse/AVRO-1566 Project: Avro Issue Type: Improvement Components: python Reporter: Jeno I. Hajdu Priority: Minor Field default values according to the spec are meant to fill in fields present in the reader schema but missing from the writer schema during deserialization. In addition to that default values could be used in the writer schema to fill in missing values when serializing. This is already supported in the Java implementation through record builders. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (AVRO-1566) Use field default value to fill in missing values when serializing records
[ https://issues.apache.org/jira/browse/AVRO-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeno I. Hajdu updated AVRO-1566: Attachment: AVRO-1566.patch Attached proposed solution Use field default value to fill in missing values when serializing records -- Key: AVRO-1566 URL: https://issues.apache.org/jira/browse/AVRO-1566 Project: Avro Issue Type: Improvement Components: python Reporter: Jeno I. Hajdu Priority: Minor Attachments: AVRO-1566.patch Field default values according to the spec are meant to fill in fields present in the reader schema but missing from the writer schema during deserialization. In addition to that default values could be used in the writer schema to fill in missing values when serializing. This is already supported in the Java implementation through record builders. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Alternative use of record field default values?
Thanks Sean, that's exactly what I was looking for. I have opened AVRO-1566 to cover this for python, also attached the patch. Regards, Jeno On Tue, Aug 12, 2014 at 3:51 PM, Sean Busbey bus...@cloudera.com wrote: If you use Record builders you will currently get this behavior in the java implementation[1]. AFAICT, there's no builder equivalent in the python implementation yet. In python maybe we can skip having a builder because we can distinguish between key maps to None from Dict does not contain key. Does that sound reasonable? Care to file a ticket and maybe propose a patch? [1]: http://avro.apache.org/docs/1.7.7/api/java/org/apache/avro/data/RecordBuilder.html or more likely the generic implementation: http://avro.apache.org/docs/1.7.7/api/java/org/apache/avro/generic/GenericRecordBuilder.html or an example of the builder in generated specific code: http://avro.apache.org/docs/current/gettingstartedjava.html#Creating+Users On Mon, Aug 11, 2014 at 2:33 PM, Jeno I. Hajdu jeno.i.ha...@gmail.com wrote: Hi, my understanding of the field default values (based on the spec) is that it is solely for filling in fields present in the reader schema, but missing in the writer schema, thus defaults only make sense in reader schemas. In addition to that couldn't defaults be used on the writer side (defined in the writer schema) to fill in fields with missing values? So if I have a record schema with 100 fields, all having defaults, I could specify only 5 field values for a record to be serialized and the Avro lib would fill in the rest. This does not impact the serialization (format) itself, the spec would only allow using defaults for this purpose (and for example adding this support to the python implementation takes 2 extra lines based on a quick trial). What do you think? Would this go against Avro's philosophy? Thanks and Regards, Jeno -- Sean
[jira] [Commented] (AVRO-1566) Use field default value to fill in missing values when serializing records
[ https://issues.apache.org/jira/browse/AVRO-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094602#comment-14094602 ] Sean Busbey commented on AVRO-1566: --- {noformat} + else: +self.write_data(field.type, None, encoder) {noformat} Worth a note that this maintains the behavior of allowing optional null fields that don't specify a default? (even though this was probably incorrect behavior, [our getting started guide mentions relying on this behavior|http://avro.apache.org/docs/1.7.7/gettingstartedpython.html#Serializing+and+deserializing+without+code+generation].) Actually, looking at io.py did defaulting to None actually work or did it throw? Use field default value to fill in missing values when serializing records -- Key: AVRO-1566 URL: https://issues.apache.org/jira/browse/AVRO-1566 Project: Avro Issue Type: Improvement Components: python Reporter: Jeno I. Hajdu Priority: Minor Attachments: AVRO-1566.patch Field default values according to the spec are meant to fill in fields present in the reader schema but missing from the writer schema during deserialization. In addition to that default values could be used in the writer schema to fill in missing values when serializing. This is already supported in the Java implementation through record builders. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (AVRO-1566) Use field default value to fill in missing values when serializing records
[ https://issues.apache.org/jira/browse/AVRO-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeno I. Hajdu updated AVRO-1566: Attachment: (was: AVRO-1566.patch) Use field default value to fill in missing values when serializing records -- Key: AVRO-1566 URL: https://issues.apache.org/jira/browse/AVRO-1566 Project: Avro Issue Type: Improvement Components: python Reporter: Jeno I. Hajdu Priority: Minor Attachments: AVRO-1566.patch Field default values according to the spec are meant to fill in fields present in the reader schema but missing from the writer schema during deserialization. In addition to that default values could be used in the writer schema to fill in missing values when serializing. This is already supported in the Java implementation through record builders. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (AVRO-1566) Use field default value to fill in missing values when serializing records
[ https://issues.apache.org/jira/browse/AVRO-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeno I. Hajdu updated AVRO-1566: Attachment: AVRO-1566.patch Fixed proposal Use field default value to fill in missing values when serializing records -- Key: AVRO-1566 URL: https://issues.apache.org/jira/browse/AVRO-1566 Project: Avro Issue Type: Improvement Components: python Reporter: Jeno I. Hajdu Priority: Minor Attachments: AVRO-1566.patch Field default values according to the spec are meant to fill in fields present in the reader schema but missing from the writer schema during deserialization. In addition to that default values could be used in the writer schema to fill in missing values when serializing. This is already supported in the Java implementation through record builders. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (AVRO-1566) Use field default value to fill in missing values when serializing records
[ https://issues.apache.org/jira/browse/AVRO-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094848#comment-14094848 ] Jeno I. Hajdu commented on AVRO-1566: - The patch I have attached first was not complete, I have attached a new one which also covers the validation code. I have check what you have mentioned, Sean. The original behaviour which is also advised in the guide worked perfectly, during writing the value is first validated and for null None is expected, so when the field type is union with null, null gets picked up. But when the field is not optional the value validation fails, resulting in an avro.io.AvroTypeException. Actually write_record itself is not reached in that case. I have extended validation to follow the extended logic of using defaults (if value is available validate it, if not and default available we are OK (assuming default value is validated on schema loading, which is not checked at the moment there's only a TODO comment in the code ... perhaps I should add validating defaults), otherwise validate None (which keeps the optional field behaviour)). The else branch of the write_record is not used in practice, I have only included it for sake of completeness. I have tried both the python 2 and 3 changes with default and optional fields. Use field default value to fill in missing values when serializing records -- Key: AVRO-1566 URL: https://issues.apache.org/jira/browse/AVRO-1566 Project: Avro Issue Type: Improvement Components: python Reporter: Jeno I. Hajdu Priority: Minor Attachments: AVRO-1566.patch Field default values according to the spec are meant to fill in fields present in the reader schema but missing from the writer schema during deserialization. In addition to that default values could be used in the writer schema to fill in missing values when serializing. This is already supported in the Java implementation through record builders. -- This message was sent by Atlassian JIRA (v6.2#6252)