rshanmugam1 opened a new issue, #7244:
URL: https://github.com/apache/hudi/issues/7244
this is my dbt model. if i run twice that creates duplicates. am i missing
any obvious configurations?.
```
{{ config(
materialized = 'incremental',
incremental_strategy = 'merge',
file_format = 'hudi',
options={
'type': 'cow',
'primaryKey': 'id',
},
unique_key = 'id',
) }}
{% if not is_incremental() %}
select cast(1 as bigint) as id, 'yo' as msg
union all
select cast(2 as bigint) as id, 'anyway' as msg
union all
select cast(3 as bigint) as id, 'yo' as msg
{% else %}
select cast(1 as bigint) as id, 'yo2' as msg
union all
select cast(2 as bigint) as id, 'anyway1' as msg
union all
select cast(3 as bigint) as id, 'yo3' as msg
{% endif %}
```
<img width="1257" alt="Screen Shot 2022-11-18 at 3 38 33 PM"
src="https://user-images.githubusercontent.com/42749351/202820863-67721eb4-8c4d-4736-9fae-966f5dbf51c4.png">
A clear and concise description of the problem.
dbt merge model product duplicates.
**To Reproduce**
Steps to reproduce the behavior:
1. run the model twice.
**Expected behavior**
duplicates should not present
**Environment Description**
* Hudi version : 0.10.1
* Spark version : 3.2.0
* Hive version : 2.3.3
* Hadoop version :
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Additional context**
it is EMR emr-6.6.0 and running thrift server on it.
```
/usr/lib/spark/sbin/start-thriftserver.sh --conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf
'spark.sql.hive.convertMetastoreParquet=false' --conf
'spark.jars=hdfs:///apps/hudi/lib/hudi-spark-bundle.jar' --conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
```
table properties
```
hoodie.table.partition.fields=
hoodie.table.type=COPY_ON_WRITE
hoodie.archivelog.folder=archived
hoodie.timeline.layout.version=1
hoodie.table.version=3
hoodie.table.recordkey.fields=id
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.table.name=test_merge_2
hoodie.datasource.write.hive_style_partitioning=true
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:"long"},{"name"\:"msg","type"\:"string"}]}
```
Create Query
```
create table dbt_tmp.test_merge_2
using hudi
options (type "cow" , primaryKey "id"
)
as
select cast(1 as bigint) as id, 'yo' as msg
union all
select cast(2 as bigint) as id, 'anyway' as msg
union all
select cast(3 as bigint) as id, 'yo' as msg
```
Merge second run
```22/11/18 23:35:18 INFO SparkExecuteStatementOperation: Submitting query
'/* {"app": "dbt", "dbt_version": "1.1.0", "profile_name": "spark_emr",
"target_name": "prod_test", "node_id": "model.transform_spark.test_merge_2"} */
create temporary view test_merge_2__dbt_tmp as
select cast(1 as bigint) as id, 'yo2' as msg
union all
select cast(2 as bigint) as id, 'anyway1' as msg
union all
select cast(3 as bigint) as id, 'yo3' as msg
' with 897bf6f7-352f-4762-a135-e693c133c5bc
22/11/18 23:35:18 INFO SparkExecuteStatementOperation: Running query with
897bf6f7-352f-4762-a135-e693c133c5bc
22/11/18 23:35:18 INFO SparkExecuteStatementOperation: Received
getNextRowSet request order=FETCH_NEXT and maxRowsL=1000 with
897bf6f7-352f-4762-a135-e693c133c5bc
22/11/18 23:35:19 INFO SparkExecuteStatementOperation: Submitting query '/*
{"app": "dbt", "dbt_version": "1.1.0", "profile_name": "spark_emr",
"target_name": "prod_test", "node_id": "model.transform_spark.test_merge_2"
merge into dbt_tmp.test_merge_2 as DBT_INTERNAL_DEST
using test_merge_2__dbt_tmp as DBT_INTERNAL_SOURCE
on
DBT_INTERNAL_SOURCE.id = DBT_INTERNAL_DEST.id
when matched then update set
*
when not matched then insert *
' with bda690e6-a81b-4cc2-830a-0654e4f43f14
22/11/18 23:35:19 INFO SparkExecuteStatementOperation: Running query with
bda690e6-a81b-4cc2-830a-0654e4f43f14
22/11/18 23:35:19 INFO SparkContext: Starting job: collect at
HoodieSparkEngineContext.java:100
22/11/18 23:35:19 INFO DAGScheduler: Got job 165 (collect at
HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:19 INFO DAGScheduler: Final stage: ResultStage 216 (collect
at HoodieSparkEngineContext.java:100)
22/11/18 23:35:19 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:19 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:19 INFO DAGScheduler: Submitting ResultStage 216
(MapPartitionsRDD[562] at map at HoodieSparkEngineContext.java:100), which has
no missing parents
22/11/18 23:35:19 INFO MemoryStore: Block broadcast_192 stored as values in
memory (estimated size 96.2 KiB, free 1048.7 MiB)
22/11/18 23:35:19 INFO MemoryStore: Block broadcast_192_piece0 stored as
bytes in memory (estimated size 35.5 KiB, free 1048.7 MiB)
22/11/18 23:35:19 INFO BlockManagerInfo: Added broadcast_192_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB,
free: 1048.8 MiB)
22/11/18 23:35:19 INFO SparkContext: Created broadcast 192 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:19 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 216 (MapPartitionsRDD[562] at map at
HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:19 INFO YarnScheduler: Adding task set 216.0 with 1 tasks
resource profile 0
22/11/18 23:35:19 INFO TaskSetManager: Starting task 0.0 in stage 216.0 (TID
330) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4417 bytes) taskResourceAssignments Map()
22/11/18 23:35:19 INFO BlockManagerInfo: Added broadcast_192_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB,
free: 10.9 GiB)
22/11/18 23:35:19 INFO TaskSetManager: Finished task 0.0 in stage 216.0 (TID
330) in 99 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:19 INFO YarnScheduler: Removed TaskSet 216.0, whose tasks
have all completed, from pool
22/11/18 23:35:19 INFO DAGScheduler: ResultStage 216 (collect at
HoodieSparkEngineContext.java:100) finished in 0.109 s
22/11/18 23:35:19 INFO DAGScheduler: Job 165 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:19 INFO YarnScheduler: Killing all running tasks in stage
216: Stage finished
22/11/18 23:35:19 INFO DAGScheduler: Job 165 finished: collect at
HoodieSparkEngineContext.java:100, took 0.109971 s
22/11/18 23:35:19 INFO SparkContext: Starting job: collect at
HoodieSparkEngineContext.java:100
22/11/18 23:35:19 INFO DAGScheduler: Got job 166 (collect at
HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:19 INFO DAGScheduler: Final stage: ResultStage 217 (collect
at HoodieSparkEngineContext.java:100)
22/11/18 23:35:19 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:19 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:19 INFO DAGScheduler: Submitting ResultStage 217
(MapPartitionsRDD[564] at map at HoodieSparkEngineContext.java:100), which has
no missing parents
22/11/18 23:35:19 INFO MemoryStore: Block broadcast_193 stored as values in
memory (estimated size 96.2 KiB, free 1048.6 MiB)
22/11/18 23:35:19 INFO MemoryStore: Block broadcast_193_piece0 stored as
bytes in memory (estimated size 35.5 KiB, free 1048.5 MiB)
22/11/18 23:35:19 INFO BlockManagerInfo: Added broadcast_193_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB,
free: 1048.7 MiB)
22/11/18 23:35:19 INFO SparkContext: Created broadcast 193 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:19 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 217 (MapPartitionsRDD[564] at map at
HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:19 INFO YarnScheduler: Adding task set 217.0 with 1 tasks
resource profile 0
22/11/18 23:35:19 INFO TaskSetManager: Starting task 0.0 in stage 217.0 (TID
331) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4390 bytes) taskResourceAssignments Map()
22/11/18 23:35:19 INFO BlockManagerInfo: Added broadcast_193_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB,
free: 10.9 GiB)
22/11/18 23:35:19 INFO TaskSetManager: Finished task 0.0 in stage 217.0 (TID
331) in 84 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:19 INFO YarnScheduler: Removed TaskSet 217.0, whose tasks
have all completed, from pool
22/11/18 23:35:19 INFO DAGScheduler: ResultStage 217 (collect at
HoodieSparkEngineContext.java:100) finished in 0.092 s
22/11/18 23:35:19 INFO DAGScheduler: Job 166 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:19 INFO YarnScheduler: Killing all running tasks in stage
217: Stage finished
22/11/18 23:35:19 INFO DAGScheduler: Job 166 finished: collect at
HoodieSparkEngineContext.java:100, took 0.093236 s
22/11/18 23:35:20 INFO Javalin:
__ __ _
/ /____ _ _ __ ____ _ / /(_)____
__ / // __ `/| | / // __ `// // // __ \
/ /_/ // /_/ / | |/ // /_/ // // // / / /
\____/ \__,_/ |___/ \__,_//_//_//_/ /_/
https://javalin.io/documentation
22/11/18 23:35:20 INFO Javalin: Starting Javalin ...
22/11/18 23:35:20 INFO Javalin: Listening on http://localhost:44599/
22/11/18 23:35:20 INFO Javalin: Javalin started in 5ms \o/
22/11/18 23:35:20 INFO SparkContext: Starting job: countByKey at
BaseSparkCommitActionExecutor.java:193
22/11/18 23:35:20 INFO DAGScheduler: Registering RDD 575 (countByKey at
BaseSparkCommitActionExecutor.java:193) as input to shuffle 27
22/11/18 23:35:20 INFO DAGScheduler: Got job 167 (countByKey at
BaseSparkCommitActionExecutor.java:193) with 3 output partitions
22/11/18 23:35:20 INFO DAGScheduler: Final stage: ResultStage 219
(countByKey at BaseSparkCommitActionExecutor.java:193)
22/11/18 23:35:20 INFO DAGScheduler: Parents of final stage:
List(ShuffleMapStage 218)
22/11/18 23:35:20 INFO DAGScheduler: Missing parents: List(ShuffleMapStage
218)
22/11/18 23:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 218
(MapPartitionsRDD[575] at countByKey at
BaseSparkCommitActionExecutor.java:193), which has no missing parents
22/11/18 23:35:20 INFO MemoryStore: Block broadcast_194 stored as values in
memory (estimated size 45.8 KiB, free 1048.5 MiB)
22/11/18 23:35:20 INFO MemoryStore: Block broadcast_194_piece0 stored as
bytes in memory (estimated size 19.6 KiB, free 1048.5 MiB)
22/11/18 23:35:20 INFO BlockManagerInfo: Added broadcast_194_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 19.6 KiB,
free: 1048.7 MiB)
22/11/18 23:35:20 INFO SparkContext: Created broadcast 194 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:20 INFO DAGScheduler: Submitting 3 missing tasks from
ShuffleMapStage 218 (MapPartitionsRDD[575] at countByKey at
BaseSparkCommitActionExecutor.java:193) (first 15 tasks are for partitions
Vector(0, 1, 2))
22/11/18 23:35:20 INFO YarnScheduler: Adding task set 218.0 with 3 tasks
resource profile 0
22/11/18 23:35:20 INFO TaskSetManager: Starting task 0.0 in stage 218.0 (TID
332) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:20 INFO TaskSetManager: Starting task 1.0 in stage 218.0 (TID
333) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 1,
PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:20 INFO TaskSetManager: Starting task 2.0 in stage 218.0 (TID
334) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 2,
PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:20 INFO BlockManagerInfo: Added broadcast_194_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 19.6 KiB,
free: 10.9 GiB)
22/11/18 23:35:20 INFO BlockManagerInfo: Added rdd_573_2 in memory on
ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 170.0 B, free: 10.9
GiB)
22/11/18 23:35:20 INFO BlockManagerInfo: Added rdd_573_1 in memory on
ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 174.0 B, free: 10.9
GiB)
22/11/18 23:35:20 INFO BlockManagerInfo: Added rdd_573_0 in memory on
ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 170.0 B, free: 10.9
GiB)
22/11/18 23:35:20 INFO TaskSetManager: Finished task 0.0 in stage 218.0 (TID
332) in 53 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/3)
22/11/18 23:35:20 INFO TaskSetManager: Finished task 1.0 in stage 218.0 (TID
333) in 53 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (2/3)
22/11/18 23:35:20 INFO TaskSetManager: Finished task 2.0 in stage 218.0 (TID
334) in 54 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (3/3)
22/11/18 23:35:20 INFO YarnScheduler: Removed TaskSet 218.0, whose tasks
have all completed, from pool
22/11/18 23:35:20 INFO DAGScheduler: ShuffleMapStage 218 (countByKey at
BaseSparkCommitActionExecutor.java:193) finished in 0.057 s
22/11/18 23:35:20 INFO DAGScheduler: looking for newly runnable stages
22/11/18 23:35:20 INFO DAGScheduler: running: Set()
22/11/18 23:35:20 INFO DAGScheduler: waiting: Set(ResultStage 219)
22/11/18 23:35:20 INFO DAGScheduler: failed: Set()
22/11/18 23:35:20 INFO DAGScheduler: Submitting ResultStage 219
(ShuffledRDD[576] at countByKey at BaseSparkCommitActionExecutor.java:193),
which has no missing parents
22/11/18 23:35:20 INFO MemoryStore: Block broadcast_195 stored as values in
memory (estimated size 5.7 KiB, free 1048.5 MiB)
22/11/18 23:35:20 INFO MemoryStore: Block broadcast_195_piece0 stored as
bytes in memory (estimated size 3.3 KiB, free 1048.5 MiB)
22/11/18 23:35:20 INFO BlockManagerInfo: Added broadcast_195_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 3.3 KiB,
free: 1048.7 MiB)
22/11/18 23:35:20 INFO SparkContext: Created broadcast 195 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:20 INFO DAGScheduler: Submitting 3 missing tasks from
ResultStage 219 (ShuffledRDD[576] at countByKey at
BaseSparkCommitActionExecutor.java:193) (first 15 tasks are for partitions
Vector(0, 1, 2))
22/11/18 23:35:20 INFO YarnScheduler: Adding task set 219.0 with 3 tasks
resource profile 0
22/11/18 23:35:20 INFO TaskSetManager: Starting task 1.0 in stage 219.0 (TID
335) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 1,
NODE_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:20 INFO TaskSetManager: Starting task 0.0 in stage 219.0 (TID
336) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:20 INFO TaskSetManager: Starting task 2.0 in stage 219.0 (TID
337) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 2,
PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:20 INFO BlockManagerInfo: Added broadcast_195_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 3.3 KiB,
free: 10.9 GiB)
22/11/18 23:35:20 INFO MapOutputTrackerMasterEndpoint: Asked to send map
output locations for shuffle 27 to 172.31.22.163:53582
22/11/18 23:35:20 INFO TaskSetManager: Finished task 0.0 in stage 219.0 (TID
336) in 20 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/3)
22/11/18 23:35:20 INFO TaskSetManager: Finished task 1.0 in stage 219.0 (TID
335) in 20 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (2/3)
22/11/18 23:35:20 INFO TaskSetManager: Finished task 2.0 in stage 219.0 (TID
337) in 20 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (3/3)
22/11/18 23:35:20 INFO YarnScheduler: Removed TaskSet 219.0, whose tasks
have all completed, from pool
22/11/18 23:35:20 INFO DAGScheduler: ResultStage 219 (countByKey at
BaseSparkCommitActionExecutor.java:193) finished in 0.022 s
22/11/18 23:35:20 INFO DAGScheduler: Job 167 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:20 INFO YarnScheduler: Killing all running tasks in stage
219: Stage finished
22/11/18 23:35:20 INFO DAGScheduler: Job 167 finished: countByKey at
BaseSparkCommitActionExecutor.java:193, took 0.081495 s
22/11/18 23:35:21 INFO SparkContext: Starting job: collect at
SparkRejectUpdateStrategy.java:52
22/11/18 23:35:21 INFO DAGScheduler: Registering RDD 579 (distinct at
SparkRejectUpdateStrategy.java:52) as input to shuffle 28
22/11/18 23:35:21 INFO DAGScheduler: Got job 168 (collect at
SparkRejectUpdateStrategy.java:52) with 3 output partitions
22/11/18 23:35:21 INFO DAGScheduler: Final stage: ResultStage 221 (collect
at SparkRejectUpdateStrategy.java:52)
22/11/18 23:35:21 INFO DAGScheduler: Parents of final stage:
List(ShuffleMapStage 220)
22/11/18 23:35:21 INFO DAGScheduler: Missing parents: List(ShuffleMapStage
220)
22/11/18 23:35:21 INFO DAGScheduler: Submitting ShuffleMapStage 220
(MapPartitionsRDD[579] at distinct at SparkRejectUpdateStrategy.java:52), which
has no missing parents
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_196 stored as values in
memory (estimated size 46.1 KiB, free 1048.4 MiB)
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_196_piece0 stored as
bytes in memory (estimated size 19.7 KiB, free 1048.4 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_196_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 19.7 KiB,
free: 1048.7 MiB)
22/11/18 23:35:21 INFO SparkContext: Created broadcast 196 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:21 INFO DAGScheduler: Submitting 3 missing tasks from
ShuffleMapStage 220 (MapPartitionsRDD[579] at distinct at
SparkRejectUpdateStrategy.java:52) (first 15 tasks are for partitions Vector(0,
1, 2))
22/11/18 23:35:21 INFO YarnScheduler: Adding task set 220.0 with 3 tasks
resource profile 0
22/11/18 23:35:21 INFO TaskSetManager: Starting task 0.0 in stage 220.0 (TID
338) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO TaskSetManager: Starting task 1.0 in stage 220.0 (TID
339) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 1,
PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO TaskSetManager: Starting task 2.0 in stage 220.0 (TID
340) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 2,
PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_196_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 19.7 KiB,
free: 10.9 GiB)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 0.0 in stage 220.0 (TID
338) in 19 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/3)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 1.0 in stage 220.0 (TID
339) in 19 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (2/3)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 2.0 in stage 220.0 (TID
340) in 19 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (3/3)
22/11/18 23:35:21 INFO YarnScheduler: Removed TaskSet 220.0, whose tasks
have all completed, from pool
22/11/18 23:35:21 INFO DAGScheduler: ShuffleMapStage 220 (distinct at
SparkRejectUpdateStrategy.java:52) finished in 0.023 s
22/11/18 23:35:21 INFO DAGScheduler: looking for newly runnable stages
22/11/18 23:35:21 INFO DAGScheduler: running: Set()
22/11/18 23:35:21 INFO DAGScheduler: waiting: Set(ResultStage 221)
22/11/18 23:35:21 INFO DAGScheduler: failed: Set()
22/11/18 23:35:21 INFO DAGScheduler: Submitting ResultStage 221
(MapPartitionsRDD[581] at distinct at SparkRejectUpdateStrategy.java:52), which
has no missing parents
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_197 stored as values in
memory (estimated size 6.5 KiB, free 1048.4 MiB)
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_197_piece0 stored as
bytes in memory (estimated size 3.6 KiB, free 1048.4 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_197_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 3.6 KiB,
free: 1048.7 MiB)
22/11/18 23:35:21 INFO SparkContext: Created broadcast 197 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:21 INFO DAGScheduler: Submitting 3 missing tasks from
ResultStage 221 (MapPartitionsRDD[581] at distinct at
SparkRejectUpdateStrategy.java:52) (first 15 tasks are for partitions Vector(0,
1, 2))
22/11/18 23:35:21 INFO YarnScheduler: Adding task set 221.0 with 3 tasks
resource profile 0
22/11/18 23:35:21 INFO TaskSetManager: Starting task 0.0 in stage 221.0 (TID
341) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO TaskSetManager: Starting task 1.0 in stage 221.0 (TID
342) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 1,
PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO TaskSetManager: Starting task 2.0 in stage 221.0 (TID
343) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 2,
PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_197_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 3.6 KiB,
free: 10.9 GiB)
22/11/18 23:35:21 INFO MapOutputTrackerMasterEndpoint: Asked to send map
output locations for shuffle 28 to 172.31.22.163:53582
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_193_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 35.5 KiB,
free: 1048.7 MiB)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 0.0 in stage 221.0 (TID
341) in 25 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/3)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 1.0 in stage 221.0 (TID
342) in 25 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (2/3)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_193_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 35.5 KiB,
free: 10.9 GiB)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 2.0 in stage 221.0 (TID
343) in 25 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (3/3)
22/11/18 23:35:21 INFO YarnScheduler: Removed TaskSet 221.0, whose tasks
have all completed, from pool
22/11/18 23:35:21 INFO DAGScheduler: ResultStage 221 (collect at
SparkRejectUpdateStrategy.java:52) finished in 0.028 s
22/11/18 23:35:21 INFO DAGScheduler: Job 168 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:21 INFO YarnScheduler: Killing all running tasks in stage
221: Stage finished
22/11/18 23:35:21 INFO DAGScheduler: Job 168 finished: collect at
SparkRejectUpdateStrategy.java:52, took 0.053273 s
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_195_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 3.3 KiB,
free: 1048.7 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_195_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 3.3 KiB,
free: 10.9 GiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_194_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 19.6 KiB,
free: 1048.7 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_194_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 19.6 KiB,
free: 10.9 GiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_192_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 35.5 KiB,
free: 1048.8 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_192_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 35.5 KiB,
free: 10.9 GiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_196_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 19.7 KiB,
free: 1048.8 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Removed broadcast_196_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 19.7 KiB,
free: 10.9 GiB)
22/11/18 23:35:21 INFO SparkContext: Starting job: collectAsMap at
UpsertPartitioner.java:256
22/11/18 23:35:21 INFO DAGScheduler: Got job 169 (collectAsMap at
UpsertPartitioner.java:256) with 1 output partitions
22/11/18 23:35:21 INFO DAGScheduler: Final stage: ResultStage 222
(collectAsMap at UpsertPartitioner.java:256)
22/11/18 23:35:21 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:21 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:21 INFO DAGScheduler: Submitting ResultStage 222
(MapPartitionsRDD[583] at mapToPair at UpsertPartitioner.java:255), which has
no missing parents
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_198 stored as values in
memory (estimated size 314.3 KiB, free 1048.5 MiB)
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_198_piece0 stored as
bytes in memory (estimated size 116.2 KiB, free 1048.4 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_198_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 116.2 KiB,
free: 1048.7 MiB)
22/11/18 23:35:21 INFO SparkContext: Created broadcast 198 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:21 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 222 (MapPartitionsRDD[583] at mapToPair at
UpsertPartitioner.java:255) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:21 INFO YarnScheduler: Adding task set 222.0 with 1 tasks
resource profile 0
22/11/18 23:35:21 INFO TaskSetManager: Starting task 0.0 in stage 222.0 (TID
344) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4344 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_198_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 116.2 KiB,
free: 10.9 GiB)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 0.0 in stage 222.0 (TID
344) in 419 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22)
(1/1)
22/11/18 23:35:21 INFO YarnScheduler: Removed TaskSet 222.0, whose tasks
have all completed, from pool
22/11/18 23:35:21 INFO DAGScheduler: ResultStage 222 (collectAsMap at
UpsertPartitioner.java:256) finished in 0.440 s
22/11/18 23:35:21 INFO DAGScheduler: Job 169 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:21 INFO YarnScheduler: Killing all running tasks in stage
222: Stage finished
22/11/18 23:35:21 INFO DAGScheduler: Job 169 finished: collectAsMap at
UpsertPartitioner.java:256, took 0.440639 s
22/11/18 23:35:21 INFO SparkContext: Starting job: isEmpty at
HoodieSparkSqlWriter.scala:657
22/11/18 23:35:21 INFO DAGScheduler: Registering RDD 584 (mapToPair at
BaseSparkCommitActionExecutor.java:227) as input to shuffle 29
22/11/18 23:35:21 INFO DAGScheduler: Got job 170 (isEmpty at
HoodieSparkSqlWriter.scala:657) with 1 output partitions
22/11/18 23:35:21 INFO DAGScheduler: Final stage: ResultStage 224 (isEmpty
at HoodieSparkSqlWriter.scala:657)
22/11/18 23:35:21 INFO DAGScheduler: Parents of final stage:
List(ShuffleMapStage 223)
22/11/18 23:35:21 INFO DAGScheduler: Missing parents: List(ShuffleMapStage
223)
22/11/18 23:35:21 INFO DAGScheduler: Submitting ShuffleMapStage 223
(MapPartitionsRDD[584] at mapToPair at BaseSparkCommitActionExecutor.java:227),
which has no missing parents
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_199 stored as values in
memory (estimated size 349.0 KiB, free 1048.0 MiB)
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_199_piece0 stored as
bytes in memory (estimated size 130.1 KiB, free 1047.9 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_199_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 130.1 KiB,
free: 1048.6 MiB)
22/11/18 23:35:21 INFO SparkContext: Created broadcast 199 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:21 INFO DAGScheduler: Submitting 3 missing tasks from
ShuffleMapStage 223 (MapPartitionsRDD[584] at mapToPair at
BaseSparkCommitActionExecutor.java:227) (first 15 tasks are for partitions
Vector(0, 1, 2))
22/11/18 23:35:21 INFO YarnScheduler: Adding task set 223.0 with 3 tasks
resource profile 0
22/11/18 23:35:21 INFO TaskSetManager: Starting task 0.0 in stage 223.0 (TID
345) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO TaskSetManager: Starting task 1.0 in stage 223.0 (TID
346) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 1,
PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO TaskSetManager: Starting task 2.0 in stage 223.0 (TID
347) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 2,
PROCESS_LOCAL, 4592 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_199_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 130.1 KiB,
free: 10.9 GiB)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 2.0 in stage 223.0 (TID
347) in 37 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/3)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 1.0 in stage 223.0 (TID
346) in 38 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (2/3)
22/11/18 23:35:21 INFO TaskSetManager: Finished task 0.0 in stage 223.0 (TID
345) in 41 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (3/3)
22/11/18 23:35:21 INFO YarnScheduler: Removed TaskSet 223.0, whose tasks
have all completed, from pool
22/11/18 23:35:21 INFO DAGScheduler: ShuffleMapStage 223 (mapToPair at
BaseSparkCommitActionExecutor.java:227) finished in 0.063 s
22/11/18 23:35:21 INFO DAGScheduler: looking for newly runnable stages
22/11/18 23:35:21 INFO DAGScheduler: running: Set()
22/11/18 23:35:21 INFO DAGScheduler: waiting: Set(ResultStage 224)
22/11/18 23:35:21 INFO DAGScheduler: failed: Set()
22/11/18 23:35:21 INFO DAGScheduler: Submitting ResultStage 224
(MapPartitionsRDD[589] at filter at HoodieSparkSqlWriter.scala:657), which has
no missing parents
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_200 stored as values in
memory (estimated size 456.3 KiB, free 1047.5 MiB)
22/11/18 23:35:21 INFO MemoryStore: Block broadcast_200_piece0 stored as
bytes in memory (estimated size 171.1 KiB, free 1047.3 MiB)
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_200_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 171.1 KiB,
free: 1048.4 MiB)
22/11/18 23:35:21 INFO SparkContext: Created broadcast 200 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:21 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 224 (MapPartitionsRDD[589] at filter at
HoodieSparkSqlWriter.scala:657) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:21 INFO YarnScheduler: Adding task set 224.0 with 1 tasks
resource profile 0
22/11/18 23:35:21 INFO TaskSetManager: Starting task 0.0 in stage 224.0 (TID
348) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
NODE_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:21 INFO BlockManagerInfo: Added broadcast_200_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 171.1 KiB,
free: 10.9 GiB)
22/11/18 23:35:21 INFO MapOutputTrackerMasterEndpoint: Asked to send map
output locations for shuffle 29 to 172.31.22.163:53582
22/11/18 23:35:23 INFO BlockManagerInfo: Added rdd_588_0 in memory on
ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 367.0 B, free: 10.9
GiB)
22/11/18 23:35:23 INFO TaskSetManager: Finished task 0.0 in stage 224.0 (TID
348) in 1911 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22)
(1/1)
22/11/18 23:35:23 INFO YarnScheduler: Removed TaskSet 224.0, whose tasks
have all completed, from pool
22/11/18 23:35:23 INFO DAGScheduler: ResultStage 224 (isEmpty at
HoodieSparkSqlWriter.scala:657) finished in 1.938 s
22/11/18 23:35:23 INFO DAGScheduler: Job 170 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:23 INFO YarnScheduler: Killing all running tasks in stage
224: Stage finished
22/11/18 23:35:23 INFO DAGScheduler: Job 170 finished: isEmpty at
HoodieSparkSqlWriter.scala:657, took 2.003411 s
22/11/18 23:35:23 INFO SparkContext: Starting job: collect at
SparkRDDWriteClient.java:124
22/11/18 23:35:23 INFO DAGScheduler: Got job 171 (collect at
SparkRDDWriteClient.java:124) with 1 output partitions
22/11/18 23:35:23 INFO DAGScheduler: Final stage: ResultStage 226 (collect
at SparkRDDWriteClient.java:124)
22/11/18 23:35:23 INFO DAGScheduler: Parents of final stage:
List(ShuffleMapStage 225)
22/11/18 23:35:23 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:23 INFO DAGScheduler: Submitting ResultStage 226
(MapPartitionsRDD[590] at map at SparkRDDWriteClient.java:124), which has no
missing parents
22/11/18 23:35:23 INFO MemoryStore: Block broadcast_201 stored as values in
memory (estimated size 456.8 KiB, free 1046.8 MiB)
22/11/18 23:35:23 INFO MemoryStore: Block broadcast_201_piece0 stored as
bytes in memory (estimated size 171.3 KiB, free 1046.7 MiB)
22/11/18 23:35:23 INFO BlockManagerInfo: Added broadcast_201_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 171.3 KiB,
free: 1048.2 MiB)
22/11/18 23:35:23 INFO SparkContext: Created broadcast 201 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:23 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 226 (MapPartitionsRDD[590] at map at SparkRDDWriteClient.java:124)
(first 15 tasks are for partitions Vector(0))
22/11/18 23:35:23 INFO YarnScheduler: Adding task set 226.0 with 1 tasks
resource profile 0
22/11/18 23:35:23 INFO TaskSetManager: Starting task 0.0 in stage 226.0 (TID
349) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
22/11/18 23:35:23 INFO BlockManagerInfo: Added broadcast_201_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 171.3 KiB,
free: 10.9 GiB)
22/11/18 23:35:23 INFO TaskSetManager: Finished task 0.0 in stage 226.0 (TID
349) in 30 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:23 INFO YarnScheduler: Removed TaskSet 226.0, whose tasks
have all completed, from pool
22/11/18 23:35:23 INFO DAGScheduler: ResultStage 226 (collect at
SparkRDDWriteClient.java:124) finished in 0.058 s
22/11/18 23:35:23 INFO DAGScheduler: Job 171 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:23 INFO YarnScheduler: Killing all running tasks in stage
226: Stage finished
22/11/18 23:35:23 INFO DAGScheduler: Job 171 finished: collect at
SparkRDDWriteClient.java:124, took 0.059674 s
22/11/18 23:35:24 INFO SparkContext: Starting job: collectAsMap at
HoodieSparkEngineContext.java:148
22/11/18 23:35:24 INFO DAGScheduler: Got job 172 (collectAsMap at
HoodieSparkEngineContext.java:148) with 2 output partitions
22/11/18 23:35:24 INFO DAGScheduler: Final stage: ResultStage 227
(collectAsMap at HoodieSparkEngineContext.java:148)
22/11/18 23:35:24 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:24 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:24 INFO DAGScheduler: Submitting ResultStage 227
(MapPartitionsRDD[592] at mapToPair at HoodieSparkEngineContext.java:145),
which has no missing parents
22/11/18 23:35:24 INFO MemoryStore: Block broadcast_202 stored as values in
memory (estimated size 106.2 KiB, free 1046.6 MiB)
22/11/18 23:35:24 INFO MemoryStore: Block broadcast_202_piece0 stored as
bytes in memory (estimated size 39.7 KiB, free 1046.5 MiB)
22/11/18 23:35:24 INFO BlockManagerInfo: Added broadcast_202_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 39.7 KiB,
free: 1048.2 MiB)
22/11/18 23:35:24 INFO SparkContext: Created broadcast 202 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:24 INFO DAGScheduler: Submitting 2 missing tasks from
ResultStage 227 (MapPartitionsRDD[592] at mapToPair at
HoodieSparkEngineContext.java:145) (first 15 tasks are for partitions Vector(0,
1))
22/11/18 23:35:24 INFO YarnScheduler: Adding task set 227.0 with 2 tasks
resource profile 0
22/11/18 23:35:24 INFO TaskSetManager: Starting task 0.0 in stage 227.0 (TID
350) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
22/11/18 23:35:24 INFO TaskSetManager: Starting task 1.0 in stage 227.0 (TID
351) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 1,
PROCESS_LOCAL, 4433 bytes) taskResourceAssignments Map()
22/11/18 23:35:24 INFO BlockManagerInfo: Added broadcast_202_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 39.7 KiB,
free: 10.9 GiB)
22/11/18 23:35:24 INFO TaskSetManager: Finished task 0.0 in stage 227.0 (TID
350) in 240 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22)
(1/2)
22/11/18 23:35:24 INFO TaskSetManager: Finished task 1.0 in stage 227.0 (TID
351) in 240 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22)
(2/2)
22/11/18 23:35:24 INFO YarnScheduler: Removed TaskSet 227.0, whose tasks
have all completed, from pool
22/11/18 23:35:24 INFO DAGScheduler: ResultStage 227 (collectAsMap at
HoodieSparkEngineContext.java:148) finished in 0.249 s
22/11/18 23:35:24 INFO DAGScheduler: Job 172 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:24 INFO YarnScheduler: Killing all running tasks in stage
227: Stage finished
22/11/18 23:35:24 INFO DAGScheduler: Job 172 finished: collectAsMap at
HoodieSparkEngineContext.java:148, took 0.250474 s
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_202_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 39.7 KiB,
free: 1048.2 MiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_202_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 39.7 KiB,
free: 10.9 GiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_197_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 3.6 KiB,
free: 1048.2 MiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_197_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 3.6 KiB,
free: 10.9 GiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_198_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 116.2 KiB,
free: 1048.3 MiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_198_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 116.2 KiB,
free: 10.9 GiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_199_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 130.1 KiB,
free: 1048.5 MiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_199_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 130.1 KiB,
free: 10.9 GiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_200_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 171.1 KiB,
free: 1048.6 MiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_200_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 171.1 KiB,
free: 10.9 GiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_201_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 171.3 KiB,
free: 1048.8 MiB)
22/11/18 23:35:25 INFO BlockManagerInfo: Removed broadcast_201_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 171.3 KiB,
free: 10.9 GiB)
22/11/18 23:35:25 INFO MapPartitionsRDD: Removing RDD 588 from persistence
list
22/11/18 23:35:25 INFO MapPartitionsRDD: Removing RDD 573 from persistence
list
22/11/18 23:35:25 INFO BlockManager: Removing RDD 588
22/11/18 23:35:25 INFO BlockManager: Removing RDD 573
22/11/18 23:35:25 INFO metastore: Trying to connect to metastore with URI
thrift://hive-writable-metastore.prod.branch.io:9083
22/11/18 23:35:25 INFO metastore: Opened a connection to metastore, current
connections: 1
22/11/18 23:35:25 INFO metastore: Connected to metastore.
22/11/18 23:35:26 INFO metastore: Closed a connection to metastore, current
connections: 0
22/11/18 23:35:26 INFO SparkContext: Starting job: collect at
HoodieSparkEngineContext.java:100
22/11/18 23:35:26 INFO DAGScheduler: Got job 173 (collect at
HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:26 INFO DAGScheduler: Final stage: ResultStage 228 (collect
at HoodieSparkEngineContext.java:100)
22/11/18 23:35:26 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:26 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:26 INFO DAGScheduler: Submitting ResultStage 228
(MapPartitionsRDD[594] at map at HoodieSparkEngineContext.java:100), which has
no missing parents
22/11/18 23:35:26 INFO MemoryStore: Block broadcast_203 stored as values in
memory (estimated size 96.2 KiB, free 1048.7 MiB)
22/11/18 23:35:26 INFO MemoryStore: Block broadcast_203_piece0 stored as
bytes in memory (estimated size 35.5 KiB, free 1048.7 MiB)
22/11/18 23:35:26 INFO BlockManagerInfo: Added broadcast_203_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB,
free: 1048.8 MiB)
22/11/18 23:35:26 INFO SparkContext: Created broadcast 203 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:26 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 228 (MapPartitionsRDD[594] at map at
HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:26 INFO YarnScheduler: Adding task set 228.0 with 1 tasks
resource profile 0
22/11/18 23:35:26 INFO TaskSetManager: Starting task 0.0 in stage 228.0 (TID
352) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4417 bytes) taskResourceAssignments Map()
22/11/18 23:35:26 INFO BlockManagerInfo: Added broadcast_203_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB,
free: 10.9 GiB)
22/11/18 23:35:26 INFO TaskSetManager: Finished task 0.0 in stage 228.0 (TID
352) in 104 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22)
(1/1)
22/11/18 23:35:26 INFO YarnScheduler: Removed TaskSet 228.0, whose tasks
have all completed, from pool
22/11/18 23:35:26 INFO DAGScheduler: ResultStage 228 (collect at
HoodieSparkEngineContext.java:100) finished in 0.113 s
22/11/18 23:35:26 INFO DAGScheduler: Job 173 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:26 INFO YarnScheduler: Killing all running tasks in stage
228: Stage finished
22/11/18 23:35:26 INFO DAGScheduler: Job 173 finished: collect at
HoodieSparkEngineContext.java:100, took 0.113273 s
22/11/18 23:35:26 INFO SparkContext: Starting job: collect at
HoodieSparkEngineContext.java:100
22/11/18 23:35:26 INFO DAGScheduler: Got job 174 (collect at
HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:26 INFO DAGScheduler: Final stage: ResultStage 229 (collect
at HoodieSparkEngineContext.java:100)
22/11/18 23:35:26 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:26 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:26 INFO DAGScheduler: Submitting ResultStage 229
(MapPartitionsRDD[596] at map at HoodieSparkEngineContext.java:100), which has
no missing parents
22/11/18 23:35:26 INFO MemoryStore: Block broadcast_204 stored as values in
memory (estimated size 96.2 KiB, free 1048.6 MiB)
22/11/18 23:35:26 INFO MemoryStore: Block broadcast_204_piece0 stored as
bytes in memory (estimated size 35.5 KiB, free 1048.5 MiB)
22/11/18 23:35:26 INFO BlockManagerInfo: Added broadcast_204_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB,
free: 1048.7 MiB)
22/11/18 23:35:26 INFO SparkContext: Created broadcast 204 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:26 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 229 (MapPartitionsRDD[596] at map at
HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:26 INFO YarnScheduler: Adding task set 229.0 with 1 tasks
resource profile 0
22/11/18 23:35:26 INFO TaskSetManager: Starting task 0.0 in stage 229.0 (TID
353) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4390 bytes) taskResourceAssignments Map()
22/11/18 23:35:26 INFO BlockManagerInfo: Added broadcast_204_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB,
free: 10.9 GiB)
22/11/18 23:35:26 INFO TaskSetManager: Finished task 0.0 in stage 229.0 (TID
353) in 79 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:26 INFO YarnScheduler: Removed TaskSet 229.0, whose tasks
have all completed, from pool
22/11/18 23:35:26 INFO DAGScheduler: ResultStage 229 (collect at
HoodieSparkEngineContext.java:100) finished in 0.087 s
22/11/18 23:35:26 INFO DAGScheduler: Job 174 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:26 INFO YarnScheduler: Killing all running tasks in stage
229: Stage finished
22/11/18 23:35:26 INFO DAGScheduler: Job 174 finished: collect at
HoodieSparkEngineContext.java:100, took 0.088012 s
22/11/18 23:35:26 INFO Javalin: Stopping Javalin ...
22/11/18 23:35:26 INFO Javalin: Javalin has stopped
22/11/18 23:35:26 INFO SparkContext: Starting job: collect at
HoodieSparkEngineContext.java:100
22/11/18 23:35:26 INFO DAGScheduler: Got job 175 (collect at
HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:26 INFO DAGScheduler: Final stage: ResultStage 230 (collect
at HoodieSparkEngineContext.java:100)
22/11/18 23:35:26 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:26 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:26 INFO DAGScheduler: Submitting ResultStage 230
(MapPartitionsRDD[598] at map at HoodieSparkEngineContext.java:100), which has
no missing parents
22/11/18 23:35:26 INFO BlockManagerInfo: Removed broadcast_204_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 35.5 KiB,
free: 1048.8 MiB)
22/11/18 23:35:26 INFO BlockManagerInfo: Removed broadcast_204_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 35.5 KiB,
free: 10.9 GiB)
22/11/18 23:35:26 INFO BlockManagerInfo: Removed broadcast_203_piece0 on
ip-172-31-22-180.us-west-1.compute.internal:44351 in memory (size: 35.5 KiB,
free: 1048.8 MiB)
22/11/18 23:35:26 INFO MemoryStore: Block broadcast_205 stored as values in
memory (estimated size 96.2 KiB, free 1048.7 MiB)
22/11/18 23:35:26 INFO BlockManagerInfo: Removed broadcast_203_piece0 on
ip-172-31-22-163.us-west-1.compute.internal:35253 in memory (size: 35.5 KiB,
free: 10.9 GiB)
22/11/18 23:35:26 INFO MemoryStore: Block broadcast_205_piece0 stored as
bytes in memory (estimated size 35.5 KiB, free 1048.7 MiB)
22/11/18 23:35:26 INFO BlockManagerInfo: Added broadcast_205_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB,
free: 1048.8 MiB)
22/11/18 23:35:26 INFO SparkContext: Created broadcast 205 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:26 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 230 (MapPartitionsRDD[598] at map at
HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:26 INFO YarnScheduler: Adding task set 230.0 with 1 tasks
resource profile 0
22/11/18 23:35:26 INFO TaskSetManager: Starting task 0.0 in stage 230.0 (TID
354) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4417 bytes) taskResourceAssignments Map()
22/11/18 23:35:26 INFO BlockManagerInfo: Added broadcast_205_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB,
free: 10.9 GiB)
22/11/18 23:35:26 INFO TaskSetManager: Finished task 0.0 in stage 230.0 (TID
354) in 73 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:26 INFO YarnScheduler: Removed TaskSet 230.0, whose tasks
have all completed, from pool
22/11/18 23:35:26 INFO DAGScheduler: ResultStage 230 (collect at
HoodieSparkEngineContext.java:100) finished in 0.083 s
22/11/18 23:35:26 INFO DAGScheduler: Job 175 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:26 INFO YarnScheduler: Killing all running tasks in stage
230: Stage finished
22/11/18 23:35:26 INFO DAGScheduler: Job 175 finished: collect at
HoodieSparkEngineContext.java:100, took 0.096327 s
22/11/18 23:35:27 INFO SparkContext: Starting job: collect at
HoodieSparkEngineContext.java:100
22/11/18 23:35:27 INFO DAGScheduler: Got job 176 (collect at
HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:27 INFO DAGScheduler: Final stage: ResultStage 231 (collect
at HoodieSparkEngineContext.java:100)
22/11/18 23:35:27 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:27 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:27 INFO DAGScheduler: Submitting ResultStage 231
(MapPartitionsRDD[600] at map at HoodieSparkEngineContext.java:100), which has
no missing parents
22/11/18 23:35:27 INFO MemoryStore: Block broadcast_206 stored as values in
memory (estimated size 96.2 KiB, free 1048.6 MiB)
22/11/18 23:35:27 INFO MemoryStore: Block broadcast_206_piece0 stored as
bytes in memory (estimated size 35.5 KiB, free 1048.5 MiB)
22/11/18 23:35:27 INFO BlockManagerInfo: Added broadcast_206_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB,
free: 1048.7 MiB)
22/11/18 23:35:27 INFO SparkContext: Created broadcast 206 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:27 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 231 (MapPartitionsRDD[600] at map at
HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:27 INFO YarnScheduler: Adding task set 231.0 with 1 tasks
resource profile 0
22/11/18 23:35:27 INFO TaskSetManager: Starting task 0.0 in stage 231.0 (TID
355) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4390 bytes) taskResourceAssignments Map()
22/11/18 23:35:27 INFO BlockManagerInfo: Added broadcast_206_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB,
free: 10.9 GiB)
22/11/18 23:35:27 INFO TaskSetManager: Finished task 0.0 in stage 231.0 (TID
355) in 76 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:27 INFO YarnScheduler: Removed TaskSet 231.0, whose tasks
have all completed, from pool
22/11/18 23:35:27 INFO DAGScheduler: ResultStage 231 (collect at
HoodieSparkEngineContext.java:100) finished in 0.084 s
22/11/18 23:35:27 INFO DAGScheduler: Job 176 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:27 INFO YarnScheduler: Killing all running tasks in stage
231: Stage finished
22/11/18 23:35:27 INFO DAGScheduler: Job 176 finished: collect at
HoodieSparkEngineContext.java:100, took 0.085097 s
22/11/18 23:35:27 INFO SparkContext: Starting job: collect at
HoodieSparkEngineContext.java:100
22/11/18 23:35:27 INFO DAGScheduler: Got job 177 (collect at
HoodieSparkEngineContext.java:100) with 1 output partitions
22/11/18 23:35:27 INFO DAGScheduler: Final stage: ResultStage 232 (collect
at HoodieSparkEngineContext.java:100)
22/11/18 23:35:27 INFO DAGScheduler: Parents of final stage: List()
22/11/18 23:35:27 INFO DAGScheduler: Missing parents: List()
22/11/18 23:35:27 INFO DAGScheduler: Submitting ResultStage 232
(MapPartitionsRDD[602] at map at HoodieSparkEngineContext.java:100), which has
no missing parents
22/11/18 23:35:27 INFO MemoryStore: Block broadcast_207 stored as values in
memory (estimated size 96.2 KiB, free 1048.4 MiB)
22/11/18 23:35:27 INFO MemoryStore: Block broadcast_207_piece0 stored as
bytes in memory (estimated size 35.5 KiB, free 1048.4 MiB)
22/11/18 23:35:27 INFO BlockManagerInfo: Added broadcast_207_piece0 in
memory on ip-172-31-22-180.us-west-1.compute.internal:44351 (size: 35.5 KiB,
free: 1048.7 MiB)
22/11/18 23:35:27 INFO SparkContext: Created broadcast 207 from broadcast at
DAGScheduler.scala:1467
22/11/18 23:35:27 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 232 (MapPartitionsRDD[602] at map at
HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/11/18 23:35:27 INFO YarnScheduler: Adding task set 232.0 with 1 tasks
resource profile 0
22/11/18 23:35:27 INFO TaskSetManager: Starting task 0.0 in stage 232.0 (TID
356) (ip-172-31-22-163.us-west-1.compute.internal, executor 22, partition 0,
PROCESS_LOCAL, 4417 bytes) taskResourceAssignments Map()
22/11/18 23:35:27 INFO BlockManagerInfo: Added broadcast_207_piece0 in
memory on ip-172-31-22-163.us-west-1.compute.internal:35253 (size: 35.5 KiB,
free: 10.9 GiB)
22/11/18 23:35:27 INFO TaskSetManager: Finished task 0.0 in stage 232.0 (TID
356) in 75 ms on ip-172-31-22-163.us-west-1.compute.internal (executor 22) (1/1)
22/11/18 23:35:27 INFO YarnScheduler: Removed TaskSet 232.0, whose tasks
have all completed, from pool
22/11/18 23:35:27 INFO DAGScheduler: ResultStage 232 (collect at
HoodieSparkEngineContext.java:100) finished in 0.083 s
22/11/18 23:35:27 INFO DAGScheduler: Job 177 is finished. Cancelling
potential speculative or zombie tasks for this job
22/11/18 23:35:27 INFO YarnScheduler: Killing all running tasks in stage
232: Stage finished
22/11/18 23:35:27 INFO DAGScheduler: Job 177 finished: collect at
HoodieSparkEngineContext.java:100, took 0.084346 s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]