Hi Guillaume,
Interesting that you brought up Shuffle. In fact we are experiencing this
issue of shuffle files being left behind and not being cleaned up. Since
this is a Spark streaming application, it is expected to stay up
indefinitely, so shuffle files being left is a big problem right now.
There are a couple of options. Increase timeout (see Spark configuration).
Also see past mails in the mailing list.
Another option you may try (I have gut feeling that may work, but I am not
sure) is calling GC on the driver periodically. The cleaning up of stuff is
tied to GCing of RDD objects
Since we are running in local mode, won't all the executors be in the same
JVM as the driver?
Thanks
NB
On Wed, Apr 8, 2015 at 1:29 PM, Tathagata Das t...@databricks.com wrote:
Its does take effect on the executors, not on the driver. Which is okay
because executors have all the data and
Thanks TD. I believe that might have been the issue. Will try for a few
days after passing in the GC option on the java command line when we start
the process.
Thanks for your timely help.
NB
On Wed, Apr 8, 2015 at 6:08 PM, Tathagata Das t...@databricks.com wrote:
Yes, in local mode they the
Yes, in local mode they the driver and executor will be same the process.
And in that case the Java options in SparkConf configuration will not
work.
On Wed, Apr 8, 2015 at 1:44 PM, N B nb.nos...@gmail.com wrote:
Since we are running in local mode, won't all the executors be in the same
JVM
I have a standalone and local Spark streaming process where we are reading
inputs using FlumeUtils. Our longest window size is 6 hours. After about a
day and a half of running without any issues, we start seeing Timeout
errors while cleaning up input blocks. This seems to cause reading from
Flume