[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821773#comment-15821773
 ] 

Steve Loughran commented on SPARK-19111:
----------------------------------------

Just realised one more thing

If the allocated threads for writing data are all used up in PUT calls, the 
thread calling write() will be blocked. And that tuning I mentioned was "keep 
#of publisher threads down to minimise heap use".

This means: if the log creation rate is > upload bandwidth for some period of 
time, whichever thread is converting history events to write calls will block. 
Hadoop 2.8 s3a reduces risk here as it will buffer to HDD by default, but even 
there, if there's a mismatch between data generation and upload rates, you'll 
be in trouble

> S3 Mesos history upload fails silently if too large
> ---------------------------------------------------
>
>                 Key: SPARK-19111
>                 URL: https://issues.apache.org/jira/browse/SPARK-19111
>             Project: Spark
>          Issue Type: Bug
>          Components: EC2, Mesos, Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:59:44,679 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> upload complete
> {code}
> But large jobs do not ever get to the {{upload complete}} log message, and 
> instead exit before completion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to