Re: How to control the number of files for dynamic partition in Spark SQL?
Thanks Deenar, both two methods work. I actually tried the second method in spark-shell, but it didn't work at that time. The reason might be: I registered the data frame eventwk as a temporary table, repartition, then register the table again. Unfortunately I could not reproduce it. Thanks again. On Sat, Jan 30, 2016 at 1:25 AM, Deenar Toraskarwrote: > The following should work as long as your tables are created using Spark > SQL > > event_wk.repartition(2).write.partitionBy("eventDate").format("parquet" > ).insertInto("event) > > If you want to stick to using "insert overwrite" for Hive compatibility, > then you can repartition twice, instead of setting the global > spark.sql.shuffle.partition parameter > > df eventwk = sqlContext.sql("some joins") // this should use the global > shuffle partition parameter > df eventwkRepartitioned = eventwk.repartition(2) > eventwkRepartitioned.registerTempTable("event_wk_repartitioned") > and use this in your insert statement. > > registering temp table is cheap > > HTH > > > On 29 January 2016 at 20:26, Benyi Wang wrote: > >> I want to insert into a partition table using dynamic partition, but I >> don’t want to have 200 files for a partition because the files will be >> small for my case. >> >> sqlContext.sql( """ >> |insert overwrite table event >> |partition(eventDate) >> |select >> | user, >> | detail, >> | eventDate >> |from event_wk >> """.stripMargin) >> >> the table “event_wk” is created from a dataframe by registerTempTable, >> which is built with some joins. If I set spark.sql.shuffle.partition=2, the >> join’s performance will be bad because that property seems global. >> >> I can do something like this: >> >> event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path) >> >> but I have to handle adding partitions by myself. >> >> Is there a way you can control the number of files just for this last >> insert step? >> >> > >
Re: How to control the number of files for dynamic partition in Spark SQL?
The following should work as long as your tables are created using Spark SQL event_wk.repartition(2).write.partitionBy("eventDate").format("parquet" ).insertInto("event) If you want to stick to using "insert overwrite" for Hive compatibility, then you can repartition twice, instead of setting the global spark.sql.shuffle.partition parameter df eventwk = sqlContext.sql("some joins") // this should use the global shuffle partition parameter df eventwkRepartitioned = eventwk.repartition(2) eventwkRepartitioned.registerTempTable("event_wk_repartitioned") and use this in your insert statement. registering temp table is cheap HTH On 29 January 2016 at 20:26, Benyi Wangwrote: > I want to insert into a partition table using dynamic partition, but I > don’t want to have 200 files for a partition because the files will be > small for my case. > > sqlContext.sql( """ > |insert overwrite table event > |partition(eventDate) > |select > | user, > | detail, > | eventDate > |from event_wk > """.stripMargin) > > the table “event_wk” is created from a dataframe by registerTempTable, > which is built with some joins. If I set spark.sql.shuffle.partition=2, the > join’s performance will be bad because that property seems global. > > I can do something like this: > > event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path) > > but I have to handle adding partitions by myself. > > Is there a way you can control the number of files just for this last > insert step? > >
How to control the number of files for dynamic partition in Spark SQL?
I want to insert into a partition table using dynamic partition, but I don’t want to have 200 files for a partition because the files will be small for my case. sqlContext.sql( """ |insert overwrite table event |partition(eventDate) |select | user, | detail, | eventDate |from event_wk """.stripMargin) the table “event_wk” is created from a dataframe by registerTempTable, which is built with some joins. If I set spark.sql.shuffle.partition=2, the join’s performance will be bad because that property seems global. I can do something like this: event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path) but I have to handle adding partitions by myself. Is there a way you can control the number of files just for this last insert step?