Hi TD,

That little experiment helped a bit. This time we did not see any
exceptions for about 16 hours but eventually it did throw the same
exceptions as before. The cleaning of the shuffle files also stopped much
before these exceptions happened - about 7-1/2 hours after startup.

I am not quite sure how to proceed from here. Any suggestions on how to
avoid these errors?

Thanks
NB


On Fri, Apr 24, 2015 at 12:57 AM, N B <nb.nos...@gmail.com> wrote:

> Hi TD,
>
> That may very well have been the case. There may be some delay on our
> output side. I have made a change just for testing that sends the output
> nowhere. I will see if that helps get rid of these errors. Then we can try
> to find out how we can optimize so that we do not lag.
>
> Questions: How can we ever be sure that a lag even if temporary never
> occur in the future? Also, should Spark not clean up any temp files that it
> still knows it might need in the (near?) future.
>
> Thanks
> Nikunj
>
>
> On Thu, Apr 23, 2015 at 12:29 PM, Tathagata Das <t...@databricks.com>
> wrote:
>
>> What was the state of your streaming application? Was it falling behind
>> with a large increasing scheduling delay?
>>
>> TD
>>
>> On Thu, Apr 23, 2015 at 11:31 AM, N B <nb.nos...@gmail.com> wrote:
>>
>>> Thanks for the response, Conor. I tried with those settings and for a
>>> while it seemed like it was cleaning up shuffle files after itself.
>>> However, after exactly 5 hours later it started throwing exceptions and
>>> eventually stopped working again. A sample stack trace is below. What is
>>> curious about 5 hours is that I set the cleaner ttl to 5 hours after
>>> changing the max window size to 1 hour (down from 6 hours in order to
>>> test). It also stopped cleaning the shuffle files after this started
>>> happening.
>>>
>>> Any idea why this could be happening?
>>>
>>> 2015-04-22 17:39:52,040 ERROR Executor task launch worker-989
>>> Executor.logError - Exception in task 0.0 in stage 215425.0 (TID 425147)
>>> java.lang.Exception: Could not compute split, block
>>> input-0-1429706099000 not found
>>>         at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>>         at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>>         at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>>         at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>>>         at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:198)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>         at java.lang.Thread.run(Thread.java:745)
>>>
>>> Thanks
>>> NB
>>>
>>>
>>> On Tue, Apr 21, 2015 at 5:14 AM, Conor Fennell <
>>> conor.fenn...@altocloud.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> We set the spark.cleaner.ttl to some reasonable time and also
>>>> set spark.streaming.unpersist=true.
>>>>
>>>>
>>>> Those together cleaned up the shuffle files for us.
>>>>
>>>>
>>>> -Conor
>>>>
>>>> On Tue, Apr 21, 2015 at 8:18 AM, N B <nb.nos...@gmail.com> wrote:
>>>>
>>>>> We already do have a cron job in place to clean just the shuffle
>>>>> files. However, what I would really like to know is whether there is a
>>>>> "proper" way of telling spark to clean up these files once its done with
>>>>> them?
>>>>>
>>>>> Thanks
>>>>> NB
>>>>>
>>>>>
>>>>> On Mon, Apr 20, 2015 at 10:47 AM, Jeetendra Gangele <
>>>>> gangele...@gmail.com> wrote:
>>>>>
>>>>>> Write a crone job for this like below
>>>>>>
>>>>>> 12 * * * *  find $SPARK_HOME/work -cmin +1440 -prune -exec rm -rf {}
>>>>>> \+
>>>>>> 32 * * * *  find /tmp -type d -cmin +1440 -name "spark-*-*-*" -prune
>>>>>> -exec rm -rf {} \+
>>>>>> 52 * * * *  find $SPARK_LOCAL_DIR -mindepth 1 -maxdepth 1 -type d
>>>>>> -cmin +1440 -name "spark-*-*-*" -prune -exec rm -rf {} \+
>>>>>>
>>>>>>
>>>>>> On 20 April 2015 at 23:12, N B <nb.nos...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I had posed this query as part of a different thread but did not get
>>>>>>> a response there. So creating a new thread hoping to catch someone's
>>>>>>> attention.
>>>>>>>
>>>>>>> We are experiencing this issue of shuffle files being left behind
>>>>>>> and not being cleaned up by Spark. Since this is a Spark streaming
>>>>>>> application, it is expected to stay up indefinitely, so shuffle files 
>>>>>>> not
>>>>>>> being cleaned up is a big problem right now. Our max window size is 6
>>>>>>> hours, so we have set up a cron job to clean up shuffle files older 
>>>>>>> than 12
>>>>>>> hours otherwise it will eat up all our disk space.
>>>>>>>
>>>>>>> Please see the following. It seems the non-cleaning of shuffle files
>>>>>>> is being documented in 1.3.1.
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/5074/files
>>>>>>> https://issues.apache.org/jira/browse/SPARK-5836
>>>>>>>
>>>>>>>
>>>>>>> Also, for some reason, the following JIRAs that were reported as
>>>>>>> functional issues were closed as Duplicates of the above Documentation 
>>>>>>> bug.
>>>>>>> Does this mean that this issue won't be tackled at all?
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-3563
>>>>>>> https://issues.apache.org/jira/browse/SPARK-4796
>>>>>>> https://issues.apache.org/jira/browse/SPARK-6011
>>>>>>>
>>>>>>> Any further insight into whether this is being looked into and
>>>>>>> meanwhile how to handle shuffle files will be greatly appreciated.
>>>>>>>
>>>>>>> Thanks
>>>>>>> NB
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to