[
https://issues.apache.org/jira/browse/PIG-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662431#comment-13662431
]
Scott Carey commented on PIG-3323:
----------------------------------
The default value in Avro is used while reading, in order to resolve
differences between a "writer's" schema and a "reader's" schema.
If you try to read a field that is a union of int and null in your reader
schema, from a file that does not contain that field, then it will use the
default value as a substitute for a *missing field* in the written data, if the
default exists. If the default does not exist in the reader's schema, an error
will be thrown.
The only way that a default value in the persisted file itself will matter is
if you are deciding what the schema to read as is based on the contents of a
file. Consider:
Two files.
The first has a schema that is one record with two fields:
* field "f1" of type int
* field "f2" of type union of int and null
The second file has a schema that is one record with two fields:
* field "f1" of type int
* field "f3" of type union of null and float, default null
You can read the first file using the schema of the second one. All "f2"'s
will be ignored, and all "f3"'s will use the default value of null since "f3"
does not exist in the first file.
You can not use the schema from the first file to read the second file, since
the second file does not contain "f2" and there is no default value to
substitute.
In both cases, "f1" is compatible.
Otherwise, if all files being read together by the loader have the same schema,
and that matches the reader's schema, then defaults are not used.
There is one other place that defaults are used, but is not part of the spec or
involved in serialization -- the Avro SpecificRecord and GenericRecord builder
objects can use the default values when creating objects.
> AVRO: default value not stored in file when given as paramter to AvroStorage
> ----------------------------------------------------------------------------
>
> Key: PIG-3323
> URL: https://issues.apache.org/jira/browse/PIG-3323
> Project: Pig
> Issue Type: Bug
> Components: piggybank
> Affects Versions: 0.11.2
> Reporter: Egil Sorensen
> Assignee: Viraj Bhat
> Labels: patch
> Fix For: 0.12, 0.11.2
>
>
> A pig script like the below succeeds, but inspecting the resulting file I
> find that the schema is stripped of the default value specification.
> {code}
> a = load ':INPATH:/types/numbers.txt' using PigStorage(':') as (intnum1000:
> int,id: int,intnum5: int,intnum100: int,intnum: int,longnum: long,floatnum:
> float,doublenum: double);
> b2 = foreach a generate id, intnum5, intnum100;
> c2 = filter b2 by 110 <= id and id < 120;
> describe c2;
> dump c2;
> store c2 into ':OUTPATH:.intermediate_2' USING
> org.apache.pig.piggybank.storage.avro.AvroStorage('
> {
> "debug" : 5,
> "schema" : {
> "name" : "schema_2",
> "type" : "record",
> "fields" : [
> {
> "name" : "id",
> "type" : [
> "null",
> "int"
> ]
> },
> {
> "name" : "intnum5",
> "type" : [
> "null",
> "int"
> ]
> },
> {
> "name" : "intnum100",
> "type" : [
> "null",
> "int"
> ],
> "default" : 0
> }
> ]
> }
> }
> ');
> {code}
> BTW, the documentation on https://cwiki.apache.org/PIG/avrostorage.html is
> mute on the subject of defaults, so first question is: is my expectation that
> the default is to be written to file not correct?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira