Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-06 Thread Chris Miller
For anyone running into this same issue, it looks like Avro deserialization
is just broken when used with SparkSQL and partitioned schemas. I created
an bug report with details and a simplified example on how to reproduce:
https://issues.apache.org/jira/browse/SPARK-13709


--
Chris Miller

On Fri, Mar 4, 2016 at 12:11 AM, Chris Miller 
wrote:

> One more thing -- just to set aside any question about my specific schema
> or data, I used the sample schema and data record from Oracle's
> documentation on Avro support. It's a pretty simple schema:
> https://docs.oracle.com/cd/E26161_02/html/GettingStartedGuide/jsonbinding-overview.html
>
> When I create a table with this schema and then try to query the
> Avro-encoded record, I get the same type of error:
>
> 
>  org.apache.avro.AvroTypeException: Found avro.FullName, expecting union
> at
> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at
> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
> at
> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
> at
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
> at
> org.apache.hadoop.hive.serde2.avro.AvroDeserializer$SchemaReEncoder.reencode(AvroDeserializer.java:111)
> at
> org.apache.hadoop.hive.serde2.avro.AvroDeserializer.deserialize(AvroDeserializer.java:175)
> at
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:201)
> at
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:409)
> at
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:408)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
>
> To me, this "feels" like a bug -- I just can't identify if it's a Spark
> issue or an Avro issue. Decoding the same files work fine with Hive, and I
> imagine the same deserializer code is used there too.
>
> Thoughts?
>
> --
> Chris Miller
>
> On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman  wrote:
>
>> your field name is
>> *enum1_values*
>>
>> but you have data
>> { "foo1": "test123", *"enum1"*: "BLUE" }
>>
>> i.e. since you defined enum and not union(null, enum)
>> it tries to find value for enum1_values and doesn't find one...
>>
>> On 3 March 2016 at 11:30, Chris Miller  wrote:
>>
>>> I've been digging into this a little deeper. Here's what I've found:
>>>
>>> test1.avsc:
>>> 
>>> {
>>>   "namespace": "com.cmiller",
>>>   "name": "test1",
>>>   "type": "record",
>>>   "fields": [
>>> { "name":"foo1", "type":"string" }
>>>   ]
>>> }
>>> 
>>>
>>> test2.avsc:
>>> 
>>> {
>>>   "namespace": 

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
One more thing -- just to set aside any question about my specific schema
or data, I used the sample schema and data record from Oracle's
documentation on Avro support. It's a pretty simple schema:
https://docs.oracle.com/cd/E26161_02/html/GettingStartedGuide/jsonbinding-overview.html

When I create a table with this schema and then try to query the
Avro-encoded record, I get the same type of error:


 org.apache.avro.AvroTypeException: Found avro.FullName, expecting union
at
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
at
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
at
org.apache.hadoop.hive.serde2.avro.AvroDeserializer$SchemaReEncoder.reencode(AvroDeserializer.java:111)
at
org.apache.hadoop.hive.serde2.avro.AvroDeserializer.deserialize(AvroDeserializer.java:175)
at
org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:201)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:409)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:408)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


To me, this "feels" like a bug -- I just can't identify if it's a Spark
issue or an Avro issue. Decoding the same files work fine with Hive, and I
imagine the same deserializer code is used there too.

Thoughts?

--
Chris Miller

On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman  wrote:

> your field name is
> *enum1_values*
>
> but you have data
> { "foo1": "test123", *"enum1"*: "BLUE" }
>
> i.e. since you defined enum and not union(null, enum)
> it tries to find value for enum1_values and doesn't find one...
>
> On 3 March 2016 at 11:30, Chris Miller  wrote:
>
>> I've been digging into this a little deeper. Here's what I've found:
>>
>> test1.avsc:
>> 
>> {
>>   "namespace": "com.cmiller",
>>   "name": "test1",
>>   "type": "record",
>>   "fields": [
>> { "name":"foo1", "type":"string" }
>>   ]
>> }
>> 
>>
>> test2.avsc:
>> 
>> {
>>   "namespace": "com.cmiller",
>>   "name": "test1",
>>   "type": "record",
>>   "fields": [
>> { "name":"foo1", "type":"string" },
>> { "name":"enum1", "type": { "type":"enum", "name":"enum1_values",
>> "symbols":["BLUE","RED", "GREEN"]} }
>>   ]
>> }
>> 
>>
>> test1.json (encoded and saved to test/test1.avro):
>> 
>> { "foo1": "test123" }
>> 
>>
>> test2.json (encoded and saved to test/test1.avro):
>> 
>> { "foo1": "test123", "enum1": "BLUE" }
>> 
>>
>> Here is how 

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
No, the name of the field is *enum1* -- the name of the field's type is
*enum1_values*. It should not be looking for enum1_values -- that's not the
way the specification states that the standard works, and it's not how any
other implementation reads Avro data.

For what it's worth, if I change enum1 to enum1_values, the data fails to
encode (as it should):


$ avro-tools fromjson --schema-file=test.avsc test.json > test.avro
Exception in thread "main" org.apache.avro.AvroTypeException: Expected
field name not found: enum1
at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.JsonDecoder.advance(JsonDecoder.java:139)
at org.apache.avro.io.JsonDecoder.readEnum(JsonDecoder.java:332)
at
org.apache.avro.io.ResolvingDecoder.readEnum(ResolvingDecoder.java:256)
at
org.apache.avro.generic.GenericDatumReader.readEnum(GenericDatumReader.java:199)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:99)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)


Any other ideas?


--
Chris Miller

On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman  wrote:

> your field name is
> *enum1_values*
>
> but you have data
> { "foo1": "test123", *"enum1"*: "BLUE" }
>
> i.e. since you defined enum and not union(null, enum)
> it tries to find value for enum1_values and doesn't find one...
>
> On 3 March 2016 at 11:30, Chris Miller  wrote:
>
>> I've been digging into this a little deeper. Here's what I've found:
>>
>> test1.avsc:
>> 
>> {
>>   "namespace": "com.cmiller",
>>   "name": "test1",
>>   "type": "record",
>>   "fields": [
>> { "name":"foo1", "type":"string" }
>>   ]
>> }
>> 
>>
>> test2.avsc:
>> 
>> {
>>   "namespace": "com.cmiller",
>>   "name": "test1",
>>   "type": "record",
>>   "fields": [
>> { "name":"foo1", "type":"string" },
>> { "name":"enum1", "type": { "type":"enum", "name":"enum1_values",
>> "symbols":["BLUE","RED", "GREEN"]} }
>>   ]
>> }
>> 
>>
>> test1.json (encoded and saved to test/test1.avro):
>> 
>> { "foo1": "test123" }
>> 
>>
>> test2.json (encoded and saved to test/test1.avro):
>> 
>> { "foo1": "test123", "enum1": "BLUE" }
>> 
>>
>> Here is how I create the tables and add the data:
>>
>> 
>> CREATE TABLE test1
>> PARTITIONED BY (ds STRING)
>> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>> STORED AS INPUTFORMAT
>> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>> TBLPROPERTIES ('avro.schema.url'='hdfs:///path/to/schemas/test1.avsc');
>>
>> ALTER TABLE test1 ADD PARTITION (ds='1') LOCATION
>> 's3://spark-data/dev/test1';
>>
>>
>> CREATE TABLE test2
>> PARTITIONED BY (ds STRING)
>> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>> STORED AS INPUTFORMAT
>> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>> TBLPROPERTIES ('avro.schema.url'='hdfs:///path/to/schemas/test2.avsc');
>>
>> ALTER TABLE test2 ADD PARTITION (ds='1') LOCATION
>> 's3://spark-data/dev/test2';
>> 
>>
>> And here's what I get:
>>
>> 
>> SELECT * FROM test1;
>> -- works fine, shows data
>>
>> SELECT * FROM test2;
>>
>> org.apache.avro.AvroTypeException: Found com.cmiller.enum1_values,
>> expecting union
>> at
>> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
>> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>> at
>> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
>> at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
>> at
>> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
>> at
>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
>> at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
>> at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
>> at
>> org.apache.hadoop.hive.serde2.avro.AvroDeserializer$SchemaReEncoder.reencode(AvroDeserializer.java:111)
>>   

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Igor Berman
your field name is
*enum1_values*

but you have data
{ "foo1": "test123", *"enum1"*: "BLUE" }

i.e. since you defined enum and not union(null, enum)
it tries to find value for enum1_values and doesn't find one...

On 3 March 2016 at 11:30, Chris Miller  wrote:

> I've been digging into this a little deeper. Here's what I've found:
>
> test1.avsc:
> 
> {
>   "namespace": "com.cmiller",
>   "name": "test1",
>   "type": "record",
>   "fields": [
> { "name":"foo1", "type":"string" }
>   ]
> }
> 
>
> test2.avsc:
> 
> {
>   "namespace": "com.cmiller",
>   "name": "test1",
>   "type": "record",
>   "fields": [
> { "name":"foo1", "type":"string" },
> { "name":"enum1", "type": { "type":"enum", "name":"enum1_values",
> "symbols":["BLUE","RED", "GREEN"]} }
>   ]
> }
> 
>
> test1.json (encoded and saved to test/test1.avro):
> 
> { "foo1": "test123" }
> 
>
> test2.json (encoded and saved to test/test1.avro):
> 
> { "foo1": "test123", "enum1": "BLUE" }
> 
>
> Here is how I create the tables and add the data:
>
> 
> CREATE TABLE test1
> PARTITIONED BY (ds STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES ('avro.schema.url'='hdfs:///path/to/schemas/test1.avsc');
>
> ALTER TABLE test1 ADD PARTITION (ds='1') LOCATION
> 's3://spark-data/dev/test1';
>
>
> CREATE TABLE test2
> PARTITIONED BY (ds STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES ('avro.schema.url'='hdfs:///path/to/schemas/test2.avsc');
>
> ALTER TABLE test2 ADD PARTITION (ds='1') LOCATION
> 's3://spark-data/dev/test2';
> 
>
> And here's what I get:
>
> 
> SELECT * FROM test1;
> -- works fine, shows data
>
> SELECT * FROM test2;
>
> org.apache.avro.AvroTypeException: Found com.cmiller.enum1_values,
> expecting union
> at
> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at
> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
> at
> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
> at
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
> at
> org.apache.hadoop.hive.serde2.avro.AvroDeserializer$SchemaReEncoder.reencode(AvroDeserializer.java:111)
> at
> org.apache.hadoop.hive.serde2.avro.AvroDeserializer.deserialize(AvroDeserializer.java:175)
> at
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:201)
> at
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:409)
> at
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:408)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at 

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
I've been digging into this a little deeper. Here's what I've found:

test1.avsc:

{
  "namespace": "com.cmiller",
  "name": "test1",
  "type": "record",
  "fields": [
{ "name":"foo1", "type":"string" }
  ]
}


test2.avsc:

{
  "namespace": "com.cmiller",
  "name": "test1",
  "type": "record",
  "fields": [
{ "name":"foo1", "type":"string" },
{ "name":"enum1", "type": { "type":"enum", "name":"enum1_values",
"symbols":["BLUE","RED", "GREEN"]} }
  ]
}


test1.json (encoded and saved to test/test1.avro):

{ "foo1": "test123" }


test2.json (encoded and saved to test/test1.avro):

{ "foo1": "test123", "enum1": "BLUE" }


Here is how I create the tables and add the data:


CREATE TABLE test1
PARTITIONED BY (ds STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///path/to/schemas/test1.avsc');

ALTER TABLE test1 ADD PARTITION (ds='1') LOCATION
's3://spark-data/dev/test1';


CREATE TABLE test2
PARTITIONED BY (ds STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///path/to/schemas/test2.avsc');

ALTER TABLE test2 ADD PARTITION (ds='1') LOCATION
's3://spark-data/dev/test2';


And here's what I get:


SELECT * FROM test1;
-- works fine, shows data

SELECT * FROM test2;

org.apache.avro.AvroTypeException: Found com.cmiller.enum1_values,
expecting union
at
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
at
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
at
org.apache.hadoop.hive.serde2.avro.AvroDeserializer$SchemaReEncoder.reencode(AvroDeserializer.java:111)
at
org.apache.hadoop.hive.serde2.avro.AvroDeserializer.deserialize(AvroDeserializer.java:175)
at
org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:201)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:409)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:408)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


In addition to the above, I also tried putting the test Avro files on HDFS
instead of S3 -- the 

Avro SerDe Issue w/ Manual Partitions?

2016-03-02 Thread Chris Miller
Hi,

I have a strange issue occurring when I use manual partitions.

If I create a table as follows, I am able to query the data with no problem:


CREATE TABLE test1
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3://analytics-bucket/prod/logs/avro/2016/03/02/'
TBLPROPERTIES ('avro.schema.url'='hdfs:///data/schemas/schema.avsc');


If I create the table like this, however, and then add a partition with a
LOCATION specified, I am unable to query:


CREATE TABLE test2
PARTITIONED BY (ds STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///data/schemas/schema.avsc');

ALTER TABLE test7 ADD PARTITION (ds='1') LOCATION
's3://analytics-bucket/prod/logs/avro/2016/03/02/';


This is what happens


SELECT * FROM test2 LIMIT 1;

org.apache.avro.AvroTypeException: Found ActionEnum, expecting union
at
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
at
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
at
org.apache.hadoop.hive.serde2.avro.AvroDeserializer$SchemaReEncoder.reencode(AvroDeserializer.java:111)
at
org.apache.hadoop.hive.serde2.avro.AvroDeserializer.deserialize(AvroDeserializer.java:175)
at
org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:201)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:409)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:408)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


The data is exactly the same, and I can still go back and query the test1
table without issue. I don't have control over the directory structure, so
I need to add the partitions manually so that I can specify a location.

For what it's worth, "ActionEnum" is the first field in my schema. This
same table and query structure works fine with Hive. When I try to run this
with SparkSQL, however, I get the above error.

Anyone have any idea what the problem is here? Thanks!

--
Chris Miller