[
https://issues.apache.org/jira/browse/SPARK-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
J.P Feng closed SPARK-18077.
----------------------------
Resolution: Won't Fix
i would try to open another one, for there are some mistakes in this issue.
> Run insert overwrite statements in spark to overwrite a partitioned table is
> very slow
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-18077
> URL: https://issues.apache.org/jira/browse/SPARK-18077
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Environment: spark 2.0
> hive 2.0.1
> driver memory: 4g
> total executors: 4
> executor memory: 10g
> total cores: 13
> Reporter: J.P Feng
> Labels: hive, insert, sparkSQL
> Original Estimate: 120h
> Remaining Estimate: 120h
>
> Hello,all. I face a strange thing in my project.
> there is a table:
> CREATE TABLE `login4game`(`account_name` string, `role_id` string,
> `server_id` string, `recdate` string)
> PARTITIONED BY (`pt` string, `dt` string) stored as orc;
> another table:
> CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name`
> string, `happened_time` int)
> PARTITIONED BY (`pt` string, `dt` string)
> ----------------------------------------------------------------------------------------------
> Test-1:
> executed sql in spark-shell or spark-sql( before i run this sql, there is
> much data in partition(pt='mix_en', dt='2016-10-21') of table login4game ):
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate from
> tbllog_login where pt='mix_en' and dt='2016-10-21'
> it will cost a lot of time, below is a part of the logs:
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> [Stage 5:=======================================> (144 + 8) /
> 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)]
> 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00,
> real=0.08 secs]
> [Stage 5:=========================================> (152 + 8) /
> 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)]
> 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00,
> real=0.08 secs]
> [Stage 5:============================================> (160 + 8) /
> 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)]
> 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00,
> real=0.07 secs]
> [Stage 5:================================================> (176 + 8) /
> 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)]
> 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00,
> real=0.06 secs]
> [Stage 5:================================================> (177 + 8) /
> 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)]
> 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00,
> real=0.06 secs]
> [Stage 5:====================================================> (192 + 8) /
> 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)]
> 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00,
> real=0.06 secs]
> Moved:
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000'
> to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved:
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001'
> to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved:
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002'
> to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved:
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003'
> to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved:
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004'
> to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> .......
> Moved:
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199'
> to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> i can see, the origin data is moved to .trash
> and then, there is no log printing, and after about 10 min, the log print
> again:
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> 16/10/24 17:24:15 INFO Hive: Replacing
> src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00000,
> dest:
> hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000,
> Status:true
> 16/10/24 17:24:15 INFO Hive: Replacing
> src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00001,
> dest:
> hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001,
> Status:true
> 16/10/24 17:24:15 INFO Hive: Replacing
> src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00002,
> dest:
> hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002,
> Status:true
> 16/10/24 17:24:15 INFO Hive: Replacing
> src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00003,
> dest:
> hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003,
> Status:true
> 16/10/24 17:24:15 INFO Hive: Replacing
> src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00004,
> dest:
> hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004,
> Status:true
> ....
> 16/10/24 17:24:15 INFO Hive: Replacing
> src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00199,
> dest:
> hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199,
> Status:true
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> after which, the sql executing is end. but it costs over 10 min to handle
> 257128 lines of data
> ---------------------------------------------------------------------------------------------
> Test-2:
> if i remove all data of partition(pt='mix_en',dt='2016-10-21') in table
> login4game, and run such sql:
> ALTER TABLE login4game DROP IF EXISTS PARTITION (pt='mix_en',
> dt='2016-10-21');
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate from
> tbllog_login where pt='mix_en' and dt='2016-10-21'
> the logs is the same:
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> [Stage 5:=======================================> (144 + 8) /
> 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)]
> 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00,
> real=0.08 secs]
> [Stage 5:=========================================> (152 + 8) /
> 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)]
> 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00,
> real=0.08 secs]
> [Stage 5:============================================> (160 + 8) /
> 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)]
> 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00,
> real=0.07 secs]
> [Stage 5:================================================> (176 + 8) /
> 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)]
> 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00,
> real=0.06 secs]
> [Stage 5:================================================> (177 + 8) /
> 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)]
> 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00,
> real=0.06 secs]
> [Stage 5:====================================================> (192 + 8) /
> 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)]
> 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00,
> real=0.06 secs]
> Moved:
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000'
> to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved:
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001'
> to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved:
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002'
> to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved:
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003'
> to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> Moved:
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004'
> to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> .......
> Moved:
> 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199'
> to trash at: hdfs://master.com/user/hadoop/.Trash/Current
> /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> but after moving to trash, i don't need to wait for 10min, and the next log
> appears immediately.
> for about 20 seconds, this sql is finished, and i get the same result as the
> above executing(the drop partitions statement costs about 5 second).
> ----------------------------------------------------------------------------------------------
> Test-3:
> now, if i run the same sql in hive server(not using spark-sql or spark-shell
> to run, just run it in beeline or hive client):
> before running, i also load the same data in to login4game in
> partition(pt='mix_en',dt='2016-10-21')
> then i run such sql:
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate from
> tbllog_login where pt='mix_en' and dt='2016-10-21'
> after running, i get the same result for about 20 seconds.
> ==========================================================
> now, it's the question, why its so slow when i use spark-sql to execute
> insert overwrite statement to handle a partitioned table with data(if there
> is no data in selected partitions, it runs fast) ?
> important point:
> 1. it costs a lot of time in running move to trash when there is much data
> 2. running the same sql in beeline is fast
> 3. drop data from selected partitions before running executing insert
> overwrite statement will speeding the execution
> Is there anyone could help me? Thanks a lot!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]