Re: How to control the number of files for dynamic partition in Spark SQL?

2016-02-01 Thread Benyi Wang
Thanks Deenar, both two methods work.

I actually tried the second method in spark-shell, but it didn't work at
that time. The reason might be: I registered the data frame eventwk as a
temporary table, repartition, then register the table again. Unfortunately
I could not reproduce it.

Thanks again.

On Sat, Jan 30, 2016 at 1:25 AM, Deenar Toraskar 
wrote:

> The following should work as long as your tables are created using Spark
> SQL
>
> event_wk.repartition(2).write.partitionBy("eventDate").format("parquet"
> ).insertInto("event)
>
> If you want to stick to using "insert overwrite" for Hive compatibility,
> then you can repartition twice, instead of setting the global
> spark.sql.shuffle.partition parameter
>
> df eventwk = sqlContext.sql("some joins") // this should use the global
> shuffle partition parameter
> df eventwkRepartitioned = eventwk.repartition(2)
> eventwkRepartitioned.registerTempTable("event_wk_repartitioned")
> and use this in your insert statement.
>
> registering temp table is cheap
>
> HTH
>
>
> On 29 January 2016 at 20:26, Benyi Wang  wrote:
>
>> I want to insert into a partition table using dynamic partition, but I
>> don’t want to have 200 files for a partition because the files will be
>> small for my case.
>>
>> sqlContext.sql(  """
>> |insert overwrite table event
>> |partition(eventDate)
>> |select
>> | user,
>> | detail,
>> | eventDate
>> |from event_wk
>>   """.stripMargin)
>>
>> the table “event_wk” is created from a dataframe by registerTempTable,
>> which is built with some joins. If I set spark.sql.shuffle.partition=2, the
>> join’s performance will be bad because that property seems global.
>>
>> I can do something like this:
>>
>> event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path)
>>
>> but I have to handle adding partitions by myself.
>>
>> Is there a way you can control the number of files just for this last
>> insert step?
>> ​
>>
>
>


Re: How to control the number of files for dynamic partition in Spark SQL?

2016-01-30 Thread Deenar Toraskar
The following should work as long as your tables are created using Spark SQL

event_wk.repartition(2).write.partitionBy("eventDate").format("parquet"
).insertInto("event)

If you want to stick to using "insert overwrite" for Hive compatibility,
then you can repartition twice, instead of setting the global
spark.sql.shuffle.partition parameter

df eventwk = sqlContext.sql("some joins") // this should use the global
shuffle partition parameter
df eventwkRepartitioned = eventwk.repartition(2)
eventwkRepartitioned.registerTempTable("event_wk_repartitioned")
and use this in your insert statement.

registering temp table is cheap

HTH


On 29 January 2016 at 20:26, Benyi Wang  wrote:

> I want to insert into a partition table using dynamic partition, but I
> don’t want to have 200 files for a partition because the files will be
> small for my case.
>
> sqlContext.sql(  """
> |insert overwrite table event
> |partition(eventDate)
> |select
> | user,
> | detail,
> | eventDate
> |from event_wk
>   """.stripMargin)
>
> the table “event_wk” is created from a dataframe by registerTempTable,
> which is built with some joins. If I set spark.sql.shuffle.partition=2, the
> join’s performance will be bad because that property seems global.
>
> I can do something like this:
>
> event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path)
>
> but I have to handle adding partitions by myself.
>
> Is there a way you can control the number of files just for this last
> insert step?
> ​
>


How to control the number of files for dynamic partition in Spark SQL?

2016-01-29 Thread Benyi Wang
I want to insert into a partition table using dynamic partition, but I
don’t want to have 200 files for a partition because the files will be
small for my case.

sqlContext.sql(  """
|insert overwrite table event
|partition(eventDate)
|select
| user,
| detail,
| eventDate
|from event_wk
  """.stripMargin)

the table “event_wk” is created from a dataframe by registerTempTable,
which is built with some joins. If I set spark.sql.shuffle.partition=2, the
join’s performance will be bad because that property seems global.

I can do something like this:

event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path)

but I have to handle adding partitions by myself.

Is there a way you can control the number of files just for this last
insert step?
​