Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-06 Thread Rex Xiong
spark.sql.hive.convertMetastoreParquet is true. I can't repro the issue of
scanning all partitions now.. : P

Anyway, I found another email thread "Re: Spark Sql behaves strangely with
tables with a lot of partitions"

I observe the same issue as Jerrick, spark driver will call listStatus
for the whole table folder even if I load a pure hive table (not spark
created), it cost some time for large partitioned table.

Even through spark.sql.parquet.cacheMetadata is true, this listStatus
will run before every query, it's not cached.



2015-11-05 8:50 GMT+08:00 Cheng Lian :

> Is there any chance that " spark.sql.hive.convertMetastoreParquet" is
> turned off?
>
> Cheng
>
> On 11/4/15 5:15 PM, Rex Xiong wrote:
>
> Thanks Cheng Lian.
> I found in 1.5, if I use spark to create this table with partition
> discovery, the partition pruning can be performed, but for my old table
> definition in pure Hive, the execution plan will do a parquet scan across
> all partitions, and it runs very slow.
> Looks like the execution plan optimization is different.
>
> 2015-11-03 23:10 GMT+08:00 Cheng Lian :
>
>> SPARK-11153 should be irrelevant because you are filtering on a partition
>> key while SPARK-11153 is about Parquet filter push-down and doesn't affect
>> partition pruning.
>>
>> Cheng
>>
>>
>> On 11/3/15 7:14 PM, Rex Xiong wrote:
>>
>> We found the query performance is very poor due to this issue
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-11153
>> We usually use filter on partition key, the date, it's in string type in
>> 1.3.1 and works great.
>> But in 1.5, it needs to do parquet scan for all partitions.
>> 2015年10月31日 下午7:38,"Rex Xiong"  写道:
>>
>>> Add back this thread to email list, forgot to reply all.
>>> 2015年10月31日 下午7:23,"Michael Armbrust" < 
>>> mich...@databricks.com> 写道:
>>>
 Not that I know of.

 On Sat, Oct 31, 2015 at 12:22 PM, Rex Xiong < 
 bycha...@gmail.com> wrote:

> Good to know that, will have a try.
> So there is no easy way to achieve it in pure hive method?
> 2015年10月 31日 下午7:17,"Michael Armbrust"  写道:
>
>> Yeah, this was rewritten to be faster in Spark 1.5.  We use it with
>> 10,000s of partitions.
>>
>> On Sat, Oct 31, 2015 at 7:17 AM, Rex Xiong < 
>> bycha...@gmail.com> wrote:
>>
>>> 1.3.1
>>> It is a lot of improvement in 1.5+?
>>>
>>> 2015-10-30 19:23 GMT+08:00 Michael Armbrust <
>>> mich...@databricks.com>:
>>>
 We have tried schema merging feature, but it's too slow, there're
> hundreds of partitions.
>
 Which version of Spark?

>>>
>>>
>>

>>
>
>


Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-04 Thread Cheng Lian
Is there any chance that " spark.sql.hive.convertMetastoreParquet" is 
turned off?


Cheng

On 11/4/15 5:15 PM, Rex Xiong wrote:

Thanks Cheng Lian.
I found in 1.5, if I use spark to create this table with partition 
discovery, the partition pruning can be performed, but for my old 
table definition in pure Hive, the execution plan will do a parquet 
scan across all partitions, and it runs very slow.

Looks like the execution plan optimization is different.

2015-11-03 23:10 GMT+08:00 Cheng Lian >:


SPARK-11153 should be irrelevant because you are filtering on a
partition key while SPARK-11153 is about Parquet filter push-down
and doesn't affect partition pruning.

Cheng


On 11/3/15 7:14 PM, Rex Xiong wrote:


We found the query performance is very poor due to this issue
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-11153
We usually use filter on partition key, the date, it's in string
type in 1.3.1 and works great.
But in 1.5, it needs to do parquet scan for all partitions.

2015年10月31日 下午7:38,"Rex Xiong" > 写道:

Add back this thread to email list, forgot to reply all.

2015年10月31日 下午7:23,"Michael Armbrust"
> 写道:

Not that I know of.

On Sat, Oct 31, 2015 at 12:22 PM, Rex Xiong
> wrote:

Good to know that, will have a try.
So there is no easy way to achieve it in pure hive
method?

2015年10月 31日 下午7:17,"Michael Armbrust"
> 写道:

Yeah, this was rewritten to be faster in Spark
1.5.  We use it with 10,000s of partitions.

On Sat, Oct 31, 2015 at 7:17 AM, Rex Xiong
>
wrote:

1.3.1
It is a lot of improvement in 1.5+?

2015-10-30 19:23 GMT+08:00 Michael Armbrust
>:

We have tried schema merging feature,
but it's too slow, there're hundreds
of partitions.

Which version of Spark?











Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-03 Thread Cheng Lian
SPARK-11153 should be irrelevant because you are filtering on a 
partition key while SPARK-11153 is about Parquet filter push-down and 
doesn't affect partition pruning.


Cheng

On 11/3/15 7:14 PM, Rex Xiong wrote:


We found the query performance is very poor due to this issue
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-11153
We usually use filter on partition key, the date, it's in string type 
in 1.3.1 and works great.

But in 1.5, it needs to do parquet scan for all partitions.

2015年10月31日 下午7:38,"Rex Xiong" > 写道:


Add back this thread to email list, forgot to reply all.

2015年10月31日 下午7:23,"Michael Armbrust"
> 写道:

Not that I know of.

On Sat, Oct 31, 2015 at 12:22 PM, Rex Xiong
> wrote:

Good to know that, will have a try.
So there is no easy way to achieve it in pure hive method?

2015年10月31日 下午7:17,"Michael Armbrust"
>
写道:

Yeah, this was rewritten to be faster in Spark 1.5. 
We use it with 10,000s of partitions.


On Sat, Oct 31, 2015 at 7:17 AM, Rex Xiong
> wrote:

1.3.1
It is a lot of improvement in 1.5+?

2015-10-30 19:23 GMT+08:00 Michael Armbrust
>:

We have tried schema merging feature, but
it's too slow, there're hundreds of
partitions.

Which version of Spark?








Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-03 Thread Rex Xiong
We found the query performance is very poor due to this issue
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-11153
We usually use filter on partition key, the date, it's in string type in
1.3.1 and works great.
But in 1.5, it needs to do parquet scan for all partitions.
2015年10月31日 下午7:38,"Rex Xiong" 写道:

> Add back this thread to email list, forgot to reply all.
> 2015年10月31日 下午7:23,"Michael Armbrust" 写道:
>
>> Not that I know of.
>>
>> On Sat, Oct 31, 2015 at 12:22 PM, Rex Xiong  wrote:
>>
>>> Good to know that, will have a try.
>>> So there is no easy way to achieve it in pure hive method?
>>> 2015年10月31日 下午7:17,"Michael Armbrust" 写道:
>>>
 Yeah, this was rewritten to be faster in Spark 1.5.  We use it with
 10,000s of partitions.

 On Sat, Oct 31, 2015 at 7:17 AM, Rex Xiong  wrote:

> 1.3.1
> It is a lot of improvement in 1.5+?
>
> 2015-10-30 19:23 GMT+08:00 Michael Armbrust :
>
>> We have tried schema merging feature, but it's too slow, there're
>>> hundreds of partitions.
>>>
>> Which version of Spark?
>>
>
>

>>


Re: Issue of Hive parquet partitioned table schema mismatch

2015-10-31 Thread Rex Xiong
Add back this thread to email list, forgot to reply all.
2015年10月31日 下午7:23,"Michael Armbrust" 写道:

> Not that I know of.
>
> On Sat, Oct 31, 2015 at 12:22 PM, Rex Xiong  wrote:
>
>> Good to know that, will have a try.
>> So there is no easy way to achieve it in pure hive method?
>> 2015年10月31日 下午7:17,"Michael Armbrust" 写道:
>>
>>> Yeah, this was rewritten to be faster in Spark 1.5.  We use it with
>>> 10,000s of partitions.
>>>
>>> On Sat, Oct 31, 2015 at 7:17 AM, Rex Xiong  wrote:
>>>
 1.3.1
 It is a lot of improvement in 1.5+?

 2015-10-30 19:23 GMT+08:00 Michael Armbrust :

> We have tried schema merging feature, but it's too slow, there're
>> hundreds of partitions.
>>
> Which version of Spark?
>


>>>
>


Re: Issue of Hive parquet partitioned table schema mismatch

2015-10-30 Thread Jörn Franke
What Storage Format?



> On 30 Oct 2015, at 12:05, Rex Xiong  wrote:
> 
> Hi folks,
> 
> I have a Hive external table with partitions.
> Every day, an App will generate a new partition day=-MM-dd stored by 
> parquet and run add-partition Hive command.
> In some cases, we will add additional column to new partitions and update 
> Hive table schema, then a query across new and old partitions will fail with 
> exception:
> 
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: cannot resolve 'newcolumn' given 
> input columns 
> 
> We have tried schema merging feature, but it's too slow, there're hundreds of 
> partitions.
> Is it possible to bypass this schema check and return a default value (such 
> as null) for missing columns?
> 
> Thank you


Issue of Hive parquet partitioned table schema mismatch

2015-10-30 Thread Rex Xiong
Hi folks,

I have a Hive external table with partitions.
Every day, an App will generate a new partition day=-MM-dd stored by
parquet and run add-partition Hive command.
In some cases, we will add additional column to new partitions and update
Hive table schema, then a query across new and old partitions will fail
with exception:

org.apache.hive.service.cli.HiveSQLException:
org.apache.spark.sql.AnalysisException: cannot resolve 'newcolumn' given
input columns 

We have tried schema merging feature, but it's too slow, there're hundreds
of partitions.
Is it possible to bypass this schema check and return a default value (such
as null) for missing columns?

Thank you


Re: Issue of Hive parquet partitioned table schema mismatch

2015-10-30 Thread Michael Armbrust
>
> We have tried schema merging feature, but it's too slow, there're hundreds
> of partitions.
>
Which version of Spark?