subject:"No space left on device\?\?"

Re: running pyspark on kubernetes - no space left on device

2022-09-01 Thread Qian SUN

Hi
Spark provides spark.local.dir configuration to specify work folder on the
pod. You can specify spark.local.dir as your mount path.

Best regards

Manoj GEORGE  于2022年9月1日周四 21:16写道：

> CONFIDENTIAL & RESTRICTED
>
> Hi Team,
>
>
>
> I am new to spark, so please excuse my ignorance.
>
>
>
> Currently we are trying to run PySpark on Kubernetes cluster. The setup is
> working fine for some jobs, but when we are processing a large file ( 36
> gb),  we run into one of space issues.
>
>
>
> Based on what was found on internet, we have mapped the local dir to a
> persistent volume. This still doesn’t solve the issue.
>
>
>
> I am not sure if it is still writing to /tmp folder on the pod. Is there
> some other setting which need to be changed for this to work.
>
>
>
> Thanks in advance.
>
>
>
>
>
>
>
> Thanks,
>
> Manoj George
>
> *Manager Database Architecture*
> M: +1 3522786801
>
> manoj.geo...@amadeus.com
>
> www.amadeus.com
> 
> 
>
>
> 
>
>
> Disclaimer: This email message and information contained in or attached to
> this message may be privileged, confidential, and protected from disclosure
> and is intended only for the person or entity to which it is addressed. Any
> review, retransmission, dissemination, printing or other use of, or taking
> of any action in reliance upon, this information by persons or entities
> other than the intended recipient is prohibited. If you receive this
> message in error, please immediately inform the sender by reply email and
> delete the message and any attachments. Thank you.
>


-- 
Best!
Qian SUN

Re: running pyspark on kubernetes - no space left on device

2022-09-01 Thread Matt Proetsch

Hi George,

You can try mounting a larger PersistentVolume to the work directory as 
described here instead of using localdir which might have site-specific size 
constraints:

https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes

-Matt

> On Sep 1, 2022, at 09:16, Manoj GEORGE  
> wrote:
> 
> 
> CONFIDENTIAL & RESTRICTED
> 
> Hi Team,
>  
> I am new to spark, so please excuse my ignorance.
>  
> Currently we are trying to run PySpark on Kubernetes cluster. The setup is 
> working fine for some jobs, but when we are processing a large file ( 36 gb), 
>  we run into one of space issues.
>  
> Based on what was found on internet, we have mapped the local dir to a 
> persistent volume. This still doesn’t solve the issue.
>  
> I am not sure if it is still writing to /tmp folder on the pod. Is there some 
> other setting which need to be changed for this to work.
>  
> Thanks in advance.
>  
>  
>  
> Thanks,
> Manoj George
> Manager Database Architecture
> M: +1 3522786801
> manoj.geo...@amadeus.com
> www.amadeus.com
> 
>  
> Disclaimer: This email message and information contained in or attached to 
> this message may be privileged, confidential, and protected from disclosure 
> and is intended only for the person or entity to which it is addressed. Any 
> review, retransmission, dissemination, printing or other use of, or taking of 
> any action in reliance upon, this information by persons or entities other 
> than the intended recipient is prohibited. If you receive this message in 
> error, please immediately inform the sender by reply email and delete the 
> message and any attachments. Thank you.

running pyspark on kubernetes - no space left on device

2022-09-01 Thread Manoj GEORGE

CONFIDENTIAL & RESTRICTED

Hi Team,

I am new to spark, so please excuse my ignorance.

Currently we are trying to run PySpark on Kubernetes cluster. The setup is 
working fine for some jobs, but when we are processing a large file ( 36 gb),  
we run into one of space issues.

Based on what was found on internet, we have mapped the local dir to a 
persistent volume. This still doesn’t solve the issue.

I am not sure if it is still writing to /tmp folder on the pod. Is there some 
other setting which need to be changed for this to work.

Thanks in advance.



Thanks,
Manoj George
Manager Database Architecture
M: +1 3522786801
manoj.geo...@amadeus.com
www.amadeus.com
[cid:image001.png@01D8BDDF.E19AB9C0]

Disclaimer: This email message and information contained in or attached to this 
message may be privileged, confidential, and protected from disclosure and is 
intended only for the person or entity to which it is addressed. Any review, 
retransmission, dissemination, printing or other use of, or taking of any 
action in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you receive this message in error, please 
immediately inform the sender by reply email and delete the message and any 
attachments. Thank you.

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sachit Murarka

Thanks Sean.

Kind Regards,
Sachit Murarka


On Mon, Mar 8, 2021 at 6:23 PM Sean Owen  wrote:

> It's there in the error: No space left on device
> You ran out of disk space (local disk) on one of your machines.
>
> On Mon, Mar 8, 2021 at 2:02 AM Sachit Murarka 
> wrote:
>
>> Hi All,
>>
>> I am getting the following error in my spark job.
>>
>> Can someone please have a look ?
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
>> in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>> 41.0 (TID 80817, executor 193): com.esotericsoftware.kryo.KryoException:
>> java.io.IOException: No space left on device\n\tat
>> com.esotericsoftware.kryo.io.Output.flush(Output.java:188)\n\tat
>> com.esotericsoftware.kryo.io.Output.require(Output.java:164)\n\tat
>> com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251)\n\tat
>> com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:237)\n\tat
>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:49)\n\tat
>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:38)\n\tat
>> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)\n\tat
>> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:245)\n\tat
>> org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134)\n\tat
>> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:241)\n\tat
>> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)\n\tat
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)\n\tat
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)\n\tat
>> org.apache.spark.scheduler.Task.run(Task.scala:123)\n\tat
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)\n\tat
>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)\n\tat
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)\n\tat
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat
>> java.lang.Thread.run(Thread.java:748)\nCaused by: java.io.IOException: No
>> space left on device\n\tat java.io.FileOutputStream.writeBytes(Native
>> Method)\n\tat
>> java.io.FileOutputStream.write(FileOutputStream.java:326)\n\tat
>> org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)\n\tat
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)\n\tat
>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)\n\tat
>> net.jpountz.lz4.LZ4BlockOutputStream.flush(LZ4BlockOutputStream.java:240)\n\tat
>> com.esotericsoftware.kryo.io.Output.flush(Output.java:186)\n\t... 19
>> more\n\nDriver stacktrace:\n\tat
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)\n\tat
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)\n\tat
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)\n\tat
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)\n\tat
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)\n\tat
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)\n\tat
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)\n\tat
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)\n\tat
>> scala.Option.foreach(Option.scala:257)\n\tat
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)\n\tat
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)\n\tat
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)\n\tat
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)\n\tat
>> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)\n\tat
>> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)\n\tat
>> org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)\n\tat
>> org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)\n\tat
>> org.apache.spark.SparkContex

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sean Owen

It's there in the error: No space left on device
You ran out of disk space (local disk) on one of your machines.

On Mon, Mar 8, 2021 at 2:02 AM Sachit Murarka 
wrote:

> Hi All,
>
> I am getting the following error in my spark job.
>
> Can someone please have a look ?
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 41.0 (TID 80817, executor 193): com.esotericsoftware.kryo.KryoException:
> java.io.IOException: No space left on device\n\tat
> com.esotericsoftware.kryo.io.Output.flush(Output.java:188)\n\tat
> com.esotericsoftware.kryo.io.Output.require(Output.java:164)\n\tat
> com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251)\n\tat
> com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:237)\n\tat
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:49)\n\tat
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:38)\n\tat
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)\n\tat
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:245)\n\tat
> org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134)\n\tat
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:241)\n\tat
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)\n\tat
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)\n\tat
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)\n\tat
> org.apache.spark.scheduler.Task.run(Task.scala:123)\n\tat
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)\n\tat
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)\n\tat
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)\n\tat
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat
> java.lang.Thread.run(Thread.java:748)\nCaused by: java.io.IOException: No
> space left on device\n\tat java.io.FileOutputStream.writeBytes(Native
> Method)\n\tat
> java.io.FileOutputStream.write(FileOutputStream.java:326)\n\tat
> org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)\n\tat
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)\n\tat
> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)\n\tat
> net.jpountz.lz4.LZ4BlockOutputStream.flush(LZ4BlockOutputStream.java:240)\n\tat
> com.esotericsoftware.kryo.io.Output.flush(Output.java:186)\n\t... 19
> more\n\nDriver stacktrace:\n\tat
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)\n\tat
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)\n\tat
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)\n\tat
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)\n\tat
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)\n\tat
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)\n\tat
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)\n\tat
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)\n\tat
> scala.Option.foreach(Option.scala:257)\n\tat
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)\n\tat
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)\n\tat
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)\n\tat
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)\n\tat
> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)\n\tat
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)\n\tat
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)\n\tat
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)\n\tat
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)\n\tat
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)\n\tat
> org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)\n\tat
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n\tat
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n\tat
> org.apach

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sachit Murarka

Hi Gourav,

I am using Pyspark . Spark version 2.4.4.
I have checked its not an space issue. Also I am using mount directory for
storing temp files.

Thanks
Sachit

On Mon, 8 Mar 2021, 13:53 Gourav Sengupta, 
wrote:

> Hi,
>
> it will be much help if you could at least format the message before
> asking people to go through it. Also I am pretty sure that the error is
> mentioned in the first line itself.
>
> Any ideas regarding the SPARK version, and environment that you are using?
>
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Mon, Mar 8, 2021 at 8:02 AM Sachit Murarka 
> wrote:
>
>> Hi All,
>>
>> I am getting the following error in my spark job.
>>
>> Can someone please have a look ?
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
>> in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>> 41.0 (TID 80817, executor 193): com.esotericsoftware.kryo.KryoException:
>> java.io.IOException: No space left on device\n\tat
>> com.esotericsoftware.kryo.io.Output.flush(Output.java:188)\n\tat
>> com.esotericsoftware.kryo.io.Output.require(Output.java:164)\n\tat
>> com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251)\n\tat
>> com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:237)\n\tat
>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:49)\n\tat
>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:38)\n\tat
>> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)\n\tat
>> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:245)\n\tat
>> org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134)\n\tat
>> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:241)\n\tat
>> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)\n\tat
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)\n\tat
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)\n\tat
>> org.apache.spark.scheduler.Task.run(Task.scala:123)\n\tat
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)\n\tat
>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)\n\tat
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)\n\tat
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat
>> java.lang.Thread.run(Thread.java:748)\nCaused by: java.io.IOException: No
>> space left on device\n\tat java.io.FileOutputStream.writeBytes(Native
>> Method)\n\tat
>> java.io.FileOutputStream.write(FileOutputStream.java:326)\n\tat
>> org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)\n\tat
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)\n\tat
>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)\n\tat
>> net.jpountz.lz4.LZ4BlockOutputStream.flush(LZ4BlockOutputStream.java:240)\n\tat
>> com.esotericsoftware.kryo.io.Output.flush(Output.java:186)\n\t... 19
>> more\n\nDriver stacktrace:\n\tat
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)\n\tat
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)\n\tat
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)\n\tat
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)\n\tat
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)\n\tat
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)\n\tat
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)\n\tat
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)\n\tat
>> scala.Option.foreach(Option.scala:257)\n\tat
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)\n\tat
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)\n\tat
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)\n\tat
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)\n\tat
>>

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Gourav Sengupta

Hi,

it will be much help if you could at least format the message before asking
people to go through it. Also I am pretty sure that the error is mentioned
in the first line itself.

Any ideas regarding the SPARK version, and environment that you are using?


Thanks and Regards,
Gourav Sengupta

On Mon, Mar 8, 2021 at 8:02 AM Sachit Murarka 
wrote:

> Hi All,
>
> I am getting the following error in my spark job.
>
> Can someone please have a look ?
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 41.0 (TID 80817, executor 193): com.esotericsoftware.kryo.KryoException:
> java.io.IOException: No space left on device\n\tat
> com.esotericsoftware.kryo.io.Output.flush(Output.java:188)\n\tat
> com.esotericsoftware.kryo.io.Output.require(Output.java:164)\n\tat
> com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251)\n\tat
> com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:237)\n\tat
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:49)\n\tat
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:38)\n\tat
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)\n\tat
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:245)\n\tat
> org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134)\n\tat
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:241)\n\tat
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)\n\tat
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)\n\tat
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)\n\tat
> org.apache.spark.scheduler.Task.run(Task.scala:123)\n\tat
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)\n\tat
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)\n\tat
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)\n\tat
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat
> java.lang.Thread.run(Thread.java:748)\nCaused by: java.io.IOException: No
> space left on device\n\tat java.io.FileOutputStream.writeBytes(Native
> Method)\n\tat
> java.io.FileOutputStream.write(FileOutputStream.java:326)\n\tat
> org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)\n\tat
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)\n\tat
> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)\n\tat
> net.jpountz.lz4.LZ4BlockOutputStream.flush(LZ4BlockOutputStream.java:240)\n\tat
> com.esotericsoftware.kryo.io.Output.flush(Output.java:186)\n\t... 19
> more\n\nDriver stacktrace:\n\tat
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)\n\tat
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)\n\tat
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)\n\tat
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)\n\tat
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)\n\tat
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)\n\tat
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)\n\tat
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)\n\tat
> scala.Option.foreach(Option.scala:257)\n\tat
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)\n\tat
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)\n\tat
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)\n\tat
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)\n\tat
> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)\n\tat
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)\n\tat
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)\n\tat
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)\n\tat
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)\n\tat
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)\n\tat
> org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)\n\

com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sachit Murarka

Hi All,

I am getting the following error in my spark job.

Can someone please have a look ?

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage
41.0 (TID 80817, executor 193): com.esotericsoftware.kryo.KryoException:
java.io.IOException: No space left on device\n\tat
com.esotericsoftware.kryo.io.Output.flush(Output.java:188)\n\tat
com.esotericsoftware.kryo.io.Output.require(Output.java:164)\n\tat
com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251)\n\tat
com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:237)\n\tat
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:49)\n\tat
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:38)\n\tat
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)\n\tat
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:245)\n\tat
org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134)\n\tat
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:241)\n\tat
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)\n\tat
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)\n\tat
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)\n\tat
org.apache.spark.scheduler.Task.run(Task.scala:123)\n\tat
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)\n\tat
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)\n\tat
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat
java.lang.Thread.run(Thread.java:748)\nCaused by: java.io.IOException: No
space left on device\n\tat java.io.FileOutputStream.writeBytes(Native
Method)\n\tat
java.io.FileOutputStream.write(FileOutputStream.java:326)\n\tat
org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)\n\tat
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)\n\tat
java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)\n\tat
net.jpountz.lz4.LZ4BlockOutputStream.flush(LZ4BlockOutputStream.java:240)\n\tat
com.esotericsoftware.kryo.io.Output.flush(Output.java:186)\n\t... 19
more\n\nDriver stacktrace:\n\tat
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)\n\tat
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)\n\tat
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)\n\tat
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)\n\tat
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)\n\tat
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)\n\tat
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)\n\tat
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)\n\tat
scala.Option.foreach(Option.scala:257)\n\tat
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)\n\tat
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)\n\tat
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)\n\tat
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)\n\tat
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)\n\tat
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)\n\tat
org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)\n\tat
org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)\n\tat
org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)\n\tat
org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)\n\tat
org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)\n\tat
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n\tat
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n\tat
org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n\tat
org.apache.spark.rdd.RDD.collect(RDD.scala:944)\n\tat
org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)\n\tat
org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)\n\tat
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat
sun.reflect.DelegatingMethodAccessorImpl.invoke

Re: No space left on device

2018-08-22 Thread Gourav Sengupta

Hi,

that was just one of the options, and not the first one, is there any
chance of trying out the other options mentioned? For example, pointing the
shuffle storage area to a location with larger space?

Regards,
Gourav Sengupta

On Wed, Aug 22, 2018 at 11:15 AM Vitaliy Pisarev <
vitaliy.pisa...@biocatch.com> wrote:

> Documentation says that 'spark.shuffle.memoryFraction' was deprecated,
> but it doesn't say what to use instead. Any idea?
>
> On Wed, Aug 22, 2018 at 9:36 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> The best part about Spark is that it is showing you which configuration
>> to tweak as well. In case you are using EMR, try to see that the
>> configuration points to the right location in the cluster "spark.local.dir".
>> If a disk is mounted across all the systems with a common path (you can
>> do that easily in EMR) then you can change the configuration to point to
>> that disk location and thereby overcome the issue.
>>
>> On another note also try to see why the data is being written to the
>> disk, is it too much shuffle, can you increase the shuffle memory as shown
>> in the error message using "spark.shuffle.memoryFraction"?
>>
>> By any change have you changed from caching to persistent data frames?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>>
>> On Tue, Aug 21, 2018 at 12:04 PM Vitaliy Pisarev <
>> vitaliy.pisa...@biocatch.com> wrote:
>>
>>> The other time when I encountered this I solved it by throwing more
>>> resources at it (stronger cluster).
>>> I was not able to understand the root cause though. I'll be happy to
>>> hear deeper insight as well.
>>>
>>> On Mon, Aug 20, 2018 at 7:08 PM, Steve Lewis 
>>> wrote:
>>>
>>>>
>>>> We are trying to run a job that has previously run on Spark 1.3 on a 
>>>> different cluster. The job was converted to 2.3 spark and this is a new 
>>>> cluster.
>>>>
>>>> The job dies after completing about a half dozen stages with
>>>>
>>>> java.io.IOException: No space left on device
>>>>
>>>>
>>>>It appears that the nodes are using local storage as tmp.
>>>>
>>>>
>>>> I could use help diagnosing the issue and how to fix it.
>>>>
>>>>
>>>> Here are the spark conf properties
>>>>
>>>> Spark Conf Properties
>>>> spark.driver.extraJavaOptions=-Djava.io.tmpdir=/scratch/home/int/eva/zorzan/sparktmp/
>>>> spark.master=spark://10.141.0.34:7077
>>>> spark.mesos.executor.memoryOverhead=3128
>>>> spark.shuffle.consolidateFiles=true
>>>> spark.shuffle.spill=falsespark.app.name=Anonymous
>>>> spark.shuffle.manager=sort
>>>> spark.storage.memoryFraction=0.3
>>>> spark.jars=file:/home/int/eva/zorzan/bin/SparkHydraV2-master/HydraSparkBuilt.jar
>>>> spark.ui.killEnabled=true
>>>> spark.shuffle.spill.compress=true
>>>> spark.shuffle.sort.bypassMergeThreshold=100
>>>> com.lordjoe.distributed.marker_property=spark_property_set
>>>> spark.executor.memory=12g
>>>> spark.mesos.coarse=true
>>>> spark.shuffle.memoryFraction=0.4
>>>> spark.serializer=org.apache.spark.serializer.KryoSerializer
>>>> spark.kryo.registrator=com.lordjoe.distributed.hydra.HydraKryoSerializer
>>>> spark.default.parallelism=360
>>>> spark.io.compression.codec=lz4
>>>> spark.reducer.maxMbInFlight=128
>>>> spark.hadoop.validateOutputSpecs=false
>>>> spark.submit.deployMode=client
>>>> spark.local.dir=/scratch/home/int/eva/zorzan/sparktmp
>>>> spark.shuffle.file.buffer.kb=1024
>>>>
>>>>
>>>>
>>>> --
>>>> Steven M. Lewis PhD
>>>> 4221 105th Ave NE
>>>> <https://maps.google.com/?q=4221+105th+Ave+NE+Kirkland,+WA+98033=gmail=g>
>>>> Kirkland, WA 98033
>>>> <https://maps.google.com/?q=4221+105th+Ave+NE+Kirkland,+WA+98033=gmail=g>
>>>> 206-384-1340 (cell)
>>>> Skype lordjoe_com
>>>>
>>>>
>>>
>

Re: No space left on device

2018-08-22 Thread Vitaliy Pisarev

Documentation says that 'spark.shuffle.memoryFraction' was deprecated, but
it doesn't say what to use instead. Any idea?

On Wed, Aug 22, 2018 at 9:36 AM, Gourav Sengupta 
wrote:

> Hi,
>
> The best part about Spark is that it is showing you which configuration to
> tweak as well. In case you are using EMR, try to see that the configuration
> points to the right location in the cluster "spark.local.dir". If a disk
> is mounted across all the systems with a common path (you can do that
> easily in EMR) then you can change the configuration to point to that disk
> location and thereby overcome the issue.
>
> On another note also try to see why the data is being written to the disk,
> is it too much shuffle, can you increase the shuffle memory as shown in the
> error message using "spark.shuffle.memoryFraction"?
>
> By any change have you changed from caching to persistent data frames?
>
>
> Regards,
> Gourav Sengupta
>
>
>
> On Tue, Aug 21, 2018 at 12:04 PM Vitaliy Pisarev <
> vitaliy.pisa...@biocatch.com> wrote:
>
>> The other time when I encountered this I solved it by throwing more
>> resources at it (stronger cluster).
>> I was not able to understand the root cause though. I'll be happy to hear
>> deeper insight as well.
>>
>> On Mon, Aug 20, 2018 at 7:08 PM, Steve Lewis 
>> wrote:
>>
>>>
>>> We are trying to run a job that has previously run on Spark 1.3 on a 
>>> different cluster. The job was converted to 2.3 spark and this is a new 
>>> cluster.
>>>
>>> The job dies after completing about a half dozen stages with
>>>
>>> java.io.IOException: No space left on device
>>>
>>>
>>>It appears that the nodes are using local storage as tmp.
>>>
>>>
>>> I could use help diagnosing the issue and how to fix it.
>>>
>>>
>>> Here are the spark conf properties
>>>
>>> Spark Conf Properties
>>> spark.driver.extraJavaOptions=-Djava.io.tmpdir=/scratch/home/int/eva/zorzan/sparktmp/
>>> spark.master=spark://10.141.0.34:7077
>>> spark.mesos.executor.memoryOverhead=3128
>>> spark.shuffle.consolidateFiles=true
>>> spark.shuffle.spill=falsespark.app.name=Anonymous
>>> spark.shuffle.manager=sort
>>> spark.storage.memoryFraction=0.3
>>> spark.jars=file:/home/int/eva/zorzan/bin/SparkHydraV2-master/HydraSparkBuilt.jar
>>> spark.ui.killEnabled=true
>>> spark.shuffle.spill.compress=true
>>> spark.shuffle.sort.bypassMergeThreshold=100
>>> com.lordjoe.distributed.marker_property=spark_property_set
>>> spark.executor.memory=12g
>>> spark.mesos.coarse=true
>>> spark.shuffle.memoryFraction=0.4
>>> spark.serializer=org.apache.spark.serializer.KryoSerializer
>>> spark.kryo.registrator=com.lordjoe.distributed.hydra.HydraKryoSerializer
>>> spark.default.parallelism=360
>>> spark.io.compression.codec=lz4
>>> spark.reducer.maxMbInFlight=128
>>> spark.hadoop.validateOutputSpecs=false
>>> spark.submit.deployMode=client
>>> spark.local.dir=/scratch/home/int/eva/zorzan/sparktmp
>>> spark.shuffle.file.buffer.kb=1024
>>>
>>>
>>>
>>> --
>>> Steven M. Lewis PhD
>>> 4221 105th Ave NE
>>> <https://maps.google.com/?q=4221+105th+Ave+NE+Kirkland,+WA+98033=gmail=g>
>>> Kirkland, WA 98033
>>> <https://maps.google.com/?q=4221+105th+Ave+NE+Kirkland,+WA+98033=gmail=g>
>>> 206-384-1340 (cell)
>>> Skype lordjoe_com
>>>
>>>
>>

Re: No space left on device

2018-08-22 Thread Gourav Sengupta

Hi,

The best part about Spark is that it is showing you which configuration to
tweak as well. In case you are using EMR, try to see that the configuration
points to the right location in the cluster "spark.local.dir". If a disk is
mounted across all the systems with a common path (you can do that easily
in EMR) then you can change the configuration to point to that disk
location and thereby overcome the issue.

On another note also try to see why the data is being written to the disk,
is it too much shuffle, can you increase the shuffle memory as shown in the
error message using "spark.shuffle.memoryFraction"?

By any change have you changed from caching to persistent data frames?


Regards,
Gourav Sengupta



On Tue, Aug 21, 2018 at 12:04 PM Vitaliy Pisarev <
vitaliy.pisa...@biocatch.com> wrote:

> The other time when I encountered this I solved it by throwing more
> resources at it (stronger cluster).
> I was not able to understand the root cause though. I'll be happy to hear
> deeper insight as well.
>
> On Mon, Aug 20, 2018 at 7:08 PM, Steve Lewis 
> wrote:
>
>>
>> We are trying to run a job that has previously run on Spark 1.3 on a 
>> different cluster. The job was converted to 2.3 spark and this is a new 
>> cluster.
>>
>> The job dies after completing about a half dozen stages with
>>
>> java.io.IOException: No space left on device
>>
>>
>>It appears that the nodes are using local storage as tmp.
>>
>>
>> I could use help diagnosing the issue and how to fix it.
>>
>>
>> Here are the spark conf properties
>>
>> Spark Conf Properties
>> spark.driver.extraJavaOptions=-Djava.io.tmpdir=/scratch/home/int/eva/zorzan/sparktmp/
>> spark.master=spark://10.141.0.34:7077
>> spark.mesos.executor.memoryOverhead=3128
>> spark.shuffle.consolidateFiles=true
>> spark.shuffle.spill=falsespark.app.name=Anonymous
>> spark.shuffle.manager=sort
>> spark.storage.memoryFraction=0.3
>> spark.jars=file:/home/int/eva/zorzan/bin/SparkHydraV2-master/HydraSparkBuilt.jar
>> spark.ui.killEnabled=true
>> spark.shuffle.spill.compress=true
>> spark.shuffle.sort.bypassMergeThreshold=100
>> com.lordjoe.distributed.marker_property=spark_property_set
>> spark.executor.memory=12g
>> spark.mesos.coarse=true
>> spark.shuffle.memoryFraction=0.4
>> spark.serializer=org.apache.spark.serializer.KryoSerializer
>> spark.kryo.registrator=com.lordjoe.distributed.hydra.HydraKryoSerializer
>> spark.default.parallelism=360
>> spark.io.compression.codec=lz4
>> spark.reducer.maxMbInFlight=128
>> spark.hadoop.validateOutputSpecs=false
>> spark.submit.deployMode=client
>> spark.local.dir=/scratch/home/int/eva/zorzan/sparktmp
>> spark.shuffle.file.buffer.kb=1024
>>
>>
>>
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave NE
>> <https://maps.google.com/?q=4221+105th+Ave+NE+Kirkland,+WA+98033=gmail=g>
>> Kirkland, WA 98033
>> <https://maps.google.com/?q=4221+105th+Ave+NE+Kirkland,+WA+98033=gmail=g>
>> 206-384-1340 (cell)
>> Skype lordjoe_com
>>
>>
>

Re: No space left on device

2018-08-21 Thread Vitaliy Pisarev

The other time when I encountered this I solved it by throwing more
resources at it (stronger cluster).
I was not able to understand the root cause though. I'll be happy to hear
deeper insight as well.

On Mon, Aug 20, 2018 at 7:08 PM, Steve Lewis  wrote:

>
> We are trying to run a job that has previously run on Spark 1.3 on a 
> different cluster. The job was converted to 2.3 spark and this is a new 
> cluster.
>
> The job dies after completing about a half dozen stages with
>
> java.io.IOException: No space left on device
>
>
>It appears that the nodes are using local storage as tmp.
>
>
> I could use help diagnosing the issue and how to fix it.
>
>
> Here are the spark conf properties
>
> Spark Conf Properties
> spark.driver.extraJavaOptions=-Djava.io.tmpdir=/scratch/home/int/eva/zorzan/sparktmp/
> spark.master=spark://10.141.0.34:7077
> spark.mesos.executor.memoryOverhead=3128
> spark.shuffle.consolidateFiles=true
> spark.shuffle.spill=falsespark.app.name=Anonymous
> spark.shuffle.manager=sort
> spark.storage.memoryFraction=0.3
> spark.jars=file:/home/int/eva/zorzan/bin/SparkHydraV2-master/HydraSparkBuilt.jar
> spark.ui.killEnabled=true
> spark.shuffle.spill.compress=true
> spark.shuffle.sort.bypassMergeThreshold=100
> com.lordjoe.distributed.marker_property=spark_property_set
> spark.executor.memory=12g
> spark.mesos.coarse=true
> spark.shuffle.memoryFraction=0.4
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> spark.kryo.registrator=com.lordjoe.distributed.hydra.HydraKryoSerializer
> spark.default.parallelism=360
> spark.io.compression.codec=lz4
> spark.reducer.maxMbInFlight=128
> spark.hadoop.validateOutputSpecs=false
> spark.submit.deployMode=client
> spark.local.dir=/scratch/home/int/eva/zorzan/sparktmp
> spark.shuffle.file.buffer.kb=1024
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> <https://maps.google.com/?q=4221+105th+Ave+NE+Kirkland,+WA+98033=gmail=g>
> Kirkland, WA 98033
> <https://maps.google.com/?q=4221+105th+Ave+NE+Kirkland,+WA+98033=gmail=g>
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>

No space left on device

2018-08-20 Thread Steve Lewis

We are trying to run a job that has previously run on Spark 1.3 on a
different cluster. The job was converted to 2.3 spark and this is a
new cluster.

The job dies after completing about a half dozen stages with

java.io.IOException: No space left on device


   It appears that the nodes are using local storage as tmp.


I could use help diagnosing the issue and how to fix it.


Here are the spark conf properties

Spark Conf Properties
spark.driver.extraJavaOptions=-Djava.io.tmpdir=/scratch/home/int/eva/zorzan/sparktmp/
spark.master=spark://10.141.0.34:7077
spark.mesos.executor.memoryOverhead=3128
spark.shuffle.consolidateFiles=true
spark.shuffle.spill=falsespark.app.name=Anonymous
spark.shuffle.manager=sort
spark.storage.memoryFraction=0.3
spark.jars=file:/home/int/eva/zorzan/bin/SparkHydraV2-master/HydraSparkBuilt.jar
spark.ui.killEnabled=true
spark.shuffle.spill.compress=true
spark.shuffle.sort.bypassMergeThreshold=100
com.lordjoe.distributed.marker_property=spark_property_set
spark.executor.memory=12g
spark.mesos.coarse=true
spark.shuffle.memoryFraction=0.4
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=com.lordjoe.distributed.hydra.HydraKryoSerializer
spark.default.parallelism=360
spark.io.compression.codec=lz4
spark.reducer.maxMbInFlight=128
spark.hadoop.validateOutputSpecs=false
spark.submit.deployMode=client
spark.local.dir=/scratch/home/int/eva/zorzan/sparktmp
spark.shuffle.file.buffer.kb=1024



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: java.nio.file.FileSystemException: /tmp/spark- .._cache : No space left on device

2018-08-19 Thread naresh Goud

Also check enough space available on /tmp directory

On Fri, Aug 17, 2018 at 10:14 AM Jeevan K. Srivatsa <
jeevansriva...@gmail.com> wrote:

> Hi Venkata,
>
> On a quick glance, it looks like a file-related issue more so than an
> executor issue. If the logs are not that important, I would clear
> /tmp/spark-events/ directory and assign a suitable permission (e.g., chmod
> 755) to that and rerun the application.
>
> chmod 755 /tmp/spark-events/
>
> Thanks and regards,
> Jeevan K. Srivatsa
>
>
> On Fri, 17 Aug 2018 at 15:20, Polisetti, Venkata Siva Rama Gopala Krishna <
> vpolise...@spglobal.com> wrote:
>
>> Hi
>>
>> Am getting below exception when I Run Spark-submit in linux machine , can
>> someone give quick solution with commands
>>
>> Driver stacktrace:
>>
>> - Job 0 failed: count at DailyGainersAndLosersPublisher.scala:145, took
>> 5.749450 s
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4
>> in stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage
>> 0.0 (TID 6, 172.29.62.145, executor 0): java.nio.file.FileSystemException:
>> /tmp/spark-523d5331-3884-440c-ac0d-f46838c2029f/executor-390c9cd7-217e-42f3-97cb-fa2734405585/spark-206d92c0-f0d3-443c-97b2-39494e2c5fdd/-4230744641534510169119_cache
>> -> ./PublishGainersandLosers-1.0-SNAPSHOT-shaded-Gopal.jar: No space left
>> on device
>>
>> at
>> sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
>>
>> at
>> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>
>> at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:253)
>>
>> at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)
>>
>> at
>> sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
>>
>> at java.nio.file.Files.copy(Files.java:1274)
>>
>> at
>> org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:625)
>>
>> at org.apache.spark.util.Utils$.copyFile(Utils.scala:596)
>>
>> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:473)
>>
>> at
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:696)
>>
>> at
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:688)
>>
>> at
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>>
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>>
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>>
>> at
>> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>>
>> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>>
>> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>>
>> at
>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>>
>> at org.apache.spark.executor.Executor.org
>> $apache$spark$executor$Executor$$updateDependencies(Executor.scala:688)
>>
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:308)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>> --
>>
>> The information contained in this message is intended only for the
>> recipient, and may be a confidential attorney-client communication or may
>> otherwise be privileged and confidential and protected from disclosure. If
>> the reader of this message is not the intended recipient, or an employee or
>> agent responsible for delivering this message to the intended recipient,
>> please be aware that any dissemination or copying of this communication is
>> strictly prohibited. If you have received this communication in error,
>> please immediately notify us by replying to the message and deleting it
>> from your computer. S Global Inc. reserves the right, subject to
>> applicable local law, to monitor, review and process the content of any
>> electronic message or information sent to or from S Global Inc. e-mail
>> addresses without informing the sender or recipient of the message. By
>> sending electronic message or information to S Global Inc. e-mail
>> addresses you, as the sender, are consenting to S Global Inc. processing
>> any of your personal data therein.
>>
> --
Thanks,
Naresh
www.linkedin.com/in/naresh-dulam
http://hadoopandspark.blogspot.com/

Re: java.nio.file.FileSystemException: /tmp/spark- .._cache : No space left on device

2018-08-17 Thread Jeevan K. Srivatsa

Hi Venkata,

On a quick glance, it looks like a file-related issue more so than an
executor issue. If the logs are not that important, I would clear
/tmp/spark-events/ directory and assign a suitable permission (e.g., chmod
755) to that and rerun the application.

chmod 755 /tmp/spark-events/

Thanks and regards,
Jeevan K. Srivatsa


On Fri, 17 Aug 2018 at 15:20, Polisetti, Venkata Siva Rama Gopala Krishna <
vpolise...@spglobal.com> wrote:

> Hi
>
> Am getting below exception when I Run Spark-submit in linux machine , can
> someone give quick solution with commands
>
> Driver stacktrace:
>
> - Job 0 failed: count at DailyGainersAndLosersPublisher.scala:145, took
> 5.749450 s
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4
> in stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage
> 0.0 (TID 6, 172.29.62.145, executor 0): java.nio.file.FileSystemException:
> /tmp/spark-523d5331-3884-440c-ac0d-f46838c2029f/executor-390c9cd7-217e-42f3-97cb-fa2734405585/spark-206d92c0-f0d3-443c-97b2-39494e2c5fdd/-4230744641534510169119_cache
> -> ./PublishGainersandLosers-1.0-SNAPSHOT-shaded-Gopal.jar: No space left
> on device
>
> at
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
>
> at
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>
> at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:253)
>
> at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)
>
> at
> sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
>
> at java.nio.file.Files.copy(Files.java:1274)
>
> at
> org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:625)
>
> at org.apache.spark.util.Utils$.copyFile(Utils.scala:596)
>
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:473)
>
> at
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:696)
>
> at
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:688)
>
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>
> at org.apache.spark.executor.Executor.org
> $apache$spark$executor$Executor$$updateDependencies(Executor.scala:688)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:308)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> --
>
> The information contained in this message is intended only for the
> recipient, and may be a confidential attorney-client communication or may
> otherwise be privileged and confidential and protected from disclosure. If
> the reader of this message is not the intended recipient, or an employee or
> agent responsible for delivering this message to the intended recipient,
> please be aware that any dissemination or copying of this communication is
> strictly prohibited. If you have received this communication in error,
> please immediately notify us by replying to the message and deleting it
> from your computer. S Global Inc. reserves the right, subject to
> applicable local law, to monitor, review and process the content of any
> electronic message or information sent to or from S Global Inc. e-mail
> addresses without informing the sender or recipient of the message. By
> sending electronic message or information to S Global Inc. e-mail
> addresses you, as the sender, are consenting to S Global Inc. processing
> any of your personal data therein.
>

java.nio.file.FileSystemException: /tmp/spark- .._cache : No space left on device

2018-08-17 Thread Polisetti, Venkata Siva Rama Gopala Krishna

Hi
Am getting below exception when I Run Spark-submit in linux machine , can 
someone give quick solution with commands
Driver stacktrace:
- Job 0 failed: count at DailyGainersAndLosersPublisher.scala:145, took 
5.749450 s
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 (TID 
6, 172.29.62.145, executor 0): java.nio.file.FileSystemException: 
/tmp/spark-523d5331-3884-440c-ac0d-f46838c2029f/executor-390c9cd7-217e-42f3-97cb-fa2734405585/spark-206d92c0-f0d3-443c-97b2-39494e2c5fdd/-4230744641534510169119_cache
 -> ./PublishGainersandLosers-1.0-SNAPSHOT-shaded-Gopal.jar: No space left on 
device
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:253)
at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)
at 
sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
at java.nio.file.Files.copy(Files.java:1274)
at 
org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:625)
at org.apache.spark.util.Utils$.copyFile(Utils.scala:596)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:473)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:696)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:688)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:688)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:308)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




The information contained in this message is intended only for the recipient, 
and may be a confidential attorney-client communication or may otherwise be 
privileged and confidential and protected from disclosure. If the reader of 
this message is not the intended recipient, or an employee or agent responsible 
for delivering this message to the intended recipient, please be aware that any 
dissemination or copying of this communication is strictly prohibited. If you 
have received this communication in error, please immediately notify us by 
replying to the message and deleting it from your computer. S Global Inc. 
reserves the right, subject to applicable local law, to monitor, review and 
process the content of any electronic message or information sent to or from 
S Global Inc. e-mail addresses without informing the sender or recipient of 
the message. By sending electronic message or information to S Global Inc. 
e-mail addresses you, as the sender, are consenting to S Global Inc. 
processing any of your personal data therein.

Re: Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread Dinesh Dharme

Yeah, you are right. I ran the experiments locally not on YARN.

On Fri, Jul 27, 2018 at 11:54 PM, Vadim Semenov  wrote:

> `spark.worker.cleanup.enabled=true` doesn't work for YARN.
> On Fri, Jul 27, 2018 at 8:52 AM dineshdharme 
> wrote:
> >
> > I am trying to do few (union + reduceByKey) operations on a hiearchical
> > dataset in a iterative fashion in rdd. The first few loops run fine but
> on
> > the subsequent loops, the operations ends up using the whole scratch
> space
> > provided to it.
> >
> > I have set the spark scratch directory, i.e. SPARK_LOCAL_DIRS , to be one
> > having 100 GB space.
> > The heirarchical dataset, whose size is (< 400kB), remains constant
> > throughout the iterations.
> > I have tried the worker cleanup flag but it has no effect i.e.
> > "spark.worker.cleanup.enabled=true"
> >
> >
> >
> > Error :
> > Caused by: java.io.IOException: No space left on device
> > at java.io.FileOutputStream.writeBytes(Native Method)
> > at java.io.FileOutputStream.write(FileOutputStream.java:326)
> > at java.io.BufferedOutputStream.flushBuffer(
> BufferedOutputStream.java:82)
> > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> > at java.io.DataOutputStream.writeLong(DataOutputStream.java:224)
> > at
> > org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$
> writeIndexFileAndCommit$1$$anonfun$apply$mcV$sp$1.apply$mcVJ$sp(
> IndexShuffleBlockResolver.scala:151)
> > at
> > org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$
> writeIndexFileAndCommit$1$$anonfun$apply$mcV$sp$1.apply(
> IndexShuffleBlockResolver.scala:149)
> > at
> > org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$
> writeIndexFileAndCommit$1$$anonfun$apply$mcV$sp$1.apply(
> IndexShuffleBlockResolver.scala:149)
> > at
> > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.
> scala:33)
> > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246)
> > at
> > org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$
> writeIndexFileAndCommit$1.apply$mcV$sp(IndexShuffleBlockResolver.
> scala:149)
> > at
> > org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$
> writeIndexFileAndCommit$1.apply(IndexShuffleBlockResolver.scala:145)
> > at
> > org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$
> writeIndexFileAndCommit$1.apply(IndexShuffleBlockResolver.scala:145)
> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> > at
> > org.apache.spark.shuffle.IndexShuffleBlockResolver.
> writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:153)
> > at
> > org.apache.spark.shuffle.sort.SortShuffleWriter.write(
> SortShuffleWriter.scala:73)
> > at
> > org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:96)
> > at
> > org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:53)
> > at org.apache.spark.scheduler.Task.run(Task.scala:109)
> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
> > at java.lang.Thread.run(Thread.java:748)
> >
> >
> > What I am trying to do (High Level):
> >
> > I have a dataset of 5 different csv ( Parent, Child1, Child2, Child21,
> > Child22 ) which are related in a hierarchical fashion as shown below.
> >
> > Parent-> Child1 -> Child2  -> Child21
> >
> > Parent-> Child1 -> Child2  -> Child22
> >
> > Each element in the tree has 14 columns (elementid, parentelement_id,
> cat1,
> > cat2, num1, num2,., num10)
> >
> > I am trying to aggregate the values of one column of Child21 into Child1
> > (i.e. 2 levels up). I am doing the same for another column value of
> Child22
> > into Child1. Then I am merging these aggregated values at the same Child1
> > level.
> >
> > This is present in the code at location :
> >
> > spark.rddexample.dummyrdd.tree.child1.events.Function1
> >
> >
> > Code which replicates the issue:
> >
> > 1] https://github.com/dineshdharme/SparkRddShuffleIssue
> >
> >
> >
> > Steps to reproduce the issue :
> >
> > 1] Clone the above repository.
> >
> > 2] Put the csvs in the "issue-data" folder in the above repository at a
> > hadoop location "hdfs:///tr

Re: Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread Vadim Semenov

`spark.worker.cleanup.enabled=true` doesn't work for YARN.
On Fri, Jul 27, 2018 at 8:52 AM dineshdharme  wrote:
>
> I am trying to do few (union + reduceByKey) operations on a hiearchical
> dataset in a iterative fashion in rdd. The first few loops run fine but on
> the subsequent loops, the operations ends up using the whole scratch space
> provided to it.
>
> I have set the spark scratch directory, i.e. SPARK_LOCAL_DIRS , to be one
> having 100 GB space.
> The heirarchical dataset, whose size is (< 400kB), remains constant
> throughout the iterations.
> I have tried the worker cleanup flag but it has no effect i.e.
> "spark.worker.cleanup.enabled=true"
>
>
>
> Error :
> Caused by: java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at java.io.DataOutputStream.writeLong(DataOutputStream.java:224)
> at
> org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1$$anonfun$apply$mcV$sp$1.apply$mcVJ$sp(IndexShuffleBlockResolver.scala:151)
> at
> org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1$$anonfun$apply$mcV$sp$1.apply(IndexShuffleBlockResolver.scala:149)
> at
> org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1$$anonfun$apply$mcV$sp$1.apply(IndexShuffleBlockResolver.scala:149)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246)
> at
> org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1.apply$mcV$sp(IndexShuffleBlockResolver.scala:149)
> at
> org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1.apply(IndexShuffleBlockResolver.scala:145)
> at
> org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1.apply(IndexShuffleBlockResolver.scala:145)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> at
> org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:153)
> at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
>
> What I am trying to do (High Level):
>
> I have a dataset of 5 different csv ( Parent, Child1, Child2, Child21,
> Child22 ) which are related in a hierarchical fashion as shown below.
>
> Parent-> Child1 -> Child2  -> Child21
>
> Parent-> Child1 -> Child2  -> Child22
>
> Each element in the tree has 14 columns (elementid, parentelement_id, cat1,
> cat2, num1, num2,., num10)
>
> I am trying to aggregate the values of one column of Child21 into Child1
> (i.e. 2 levels up). I am doing the same for another column value of Child22
> into Child1. Then I am merging these aggregated values at the same Child1
> level.
>
> This is present in the code at location :
>
> spark.rddexample.dummyrdd.tree.child1.events.Function1
>
>
> Code which replicates the issue:
>
> 1] https://github.com/dineshdharme/SparkRddShuffleIssue
>
>
>
> Steps to reproduce the issue :
>
> 1] Clone the above repository.
>
> 2] Put the csvs in the "issue-data" folder in the above repository at a
> hadoop location "hdfs:///tree/dummy/data/"
>
> 3] Set the spark scratch directory (SPARK_LOCAL_DIRS) to a folder which has
> large space. (> 100 GB)
>
> 4] Run "sbt assembly"
>
> 5] Run the following command at the project location
>
> /path/to/spark-2.3.0-bin-hadoop2.7/bin/spark-submit \
> --class spark.rddexample.dummyrdd.FunctionExecutor \
> --master local[2] \
> --deploy-mode client \
> --executor-memory 2G \
> --driver-memory 2G \
> target/scala-2.11/rdd-shuffle-assembly-0.1.0.jar \
> 20 \
> hdfs:///tree/dummy/data/ \
> hdfs:///tree/dummy/results/
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


-- 
Sent from my iPhone

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread dineshdharme

I am trying to do few (union + reduceByKey) operations on a hiearchical
dataset in a iterative fashion in rdd. The first few loops run fine but on
the subsequent loops, the operations ends up using the whole scratch space
provided to it. 

I have set the spark scratch directory, i.e. SPARK_LOCAL_DIRS , to be one
having 100 GB space.
The heirarchical dataset, whose size is (< 400kB), remains constant
throughout the iterations.
I have tried the worker cleanup flag but it has no effect i.e.
"spark.worker.cleanup.enabled=true"

 

Error : 
Caused by: java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.writeLong(DataOutputStream.java:224)
at
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1$$anonfun$apply$mcV$sp$1.apply$mcVJ$sp(IndexShuffleBlockResolver.scala:151)
at
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1$$anonfun$apply$mcV$sp$1.apply(IndexShuffleBlockResolver.scala:149)
at
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1$$anonfun$apply$mcV$sp$1.apply(IndexShuffleBlockResolver.scala:149)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246)
at
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1.apply$mcV$sp(IndexShuffleBlockResolver.scala:149)
at
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1.apply(IndexShuffleBlockResolver.scala:145)
at
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFileAndCommit$1.apply(IndexShuffleBlockResolver.scala:145)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at
org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:153)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
 

What I am trying to do (High Level):

I have a dataset of 5 different csv ( Parent, Child1, Child2, Child21,
Child22 ) which are related in a hierarchical fashion as shown below. 

Parent-> Child1 -> Child2  -> Child21 

Parent-> Child1 -> Child2  -> Child22 

Each element in the tree has 14 columns (elementid, parentelement_id, cat1,
cat2, num1, num2,., num10)

I am trying to aggregate the values of one column of Child21 into Child1
(i.e. 2 levels up). I am doing the same for another column value of Child22
into Child1. Then I am merging these aggregated values at the same Child1
level.

This is present in the code at location : 

spark.rddexample.dummyrdd.tree.child1.events.Function1
 

Code which replicates the issue: 

1] https://github.com/dineshdharme/SparkRddShuffleIssue

 

Steps to reproduce the issue : 

1] Clone the above repository.

2] Put the csvs in the "issue-data" folder in the above repository at a
hadoop location "hdfs:///tree/dummy/data/"

3] Set the spark scratch directory (SPARK_LOCAL_DIRS) to a folder which has
large space. (> 100 GB)

4] Run "sbt assembly"

5] Run the following command at the project location 

/path/to/spark-2.3.0-bin-hadoop2.7/bin/spark-submit \
--class spark.rddexample.dummyrdd.FunctionExecutor \
--master local[2] \
--deploy-mode client \
--executor-memory 2G \
--driver-memory 2G \
target/scala-2.11/rdd-shuffle-assembly-0.1.0.jar \
20 \
hdfs:///tree/dummy/data/ \
hdfs:///tree/dummy/results/   



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: No space left on device

2017-10-17 Thread Imran Rajjad

don't think so. check out the documentation for this method

On Wed, Oct 18, 2017 at 10:11 AM, Mina Aslani <aslanim...@gmail.com> wrote:

> I have not tried rdd.unpersist(), I thought using rdd = null is the same,
> is it not?
>
> On Wed, Oct 18, 2017 at 1:07 AM, Imran Rajjad <raj...@gmail.com> wrote:
>
>> did you try calling rdd.unpersist()
>>
>> On Wed, Oct 18, 2017 at 10:04 AM, Mina Aslani <aslanim...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I get "No space left on device" error in my spark worker:
>>>
>>> Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
>>> java.io.IOException: No space left on device
>>>
>>> In my spark cluster, I have one worker and one master.
>>> My program consumes stream of data from kafka and publishes the result
>>> into kafka. I set my RDD = null after I finish working, so that
>>> intermediate shuffle files are removed quickly.
>>>
>>> How can I avoid "No space left on device"?
>>>
>>> Best regards,
>>> Mina
>>>
>>
>>
>>
>> --
>> I.R
>>
>
>


-- 
I.R

Re: No space left on device

2017-10-17 Thread Mina Aslani

I have not tried rdd.unpersist(), I thought using rdd = null is the same,
is it not?

On Wed, Oct 18, 2017 at 1:07 AM, Imran Rajjad <raj...@gmail.com> wrote:

> did you try calling rdd.unpersist()
>
> On Wed, Oct 18, 2017 at 10:04 AM, Mina Aslani <aslanim...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I get "No space left on device" error in my spark worker:
>>
>> Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
>> java.io.IOException: No space left on device
>>
>> In my spark cluster, I have one worker and one master.
>> My program consumes stream of data from kafka and publishes the result
>> into kafka. I set my RDD = null after I finish working, so that
>> intermediate shuffle files are removed quickly.
>>
>> How can I avoid "No space left on device"?
>>
>> Best regards,
>> Mina
>>
>
>
>
> --
> I.R
>

Re: No space left on device

2017-10-17 Thread Imran Rajjad

did you try calling rdd.unpersist()

On Wed, Oct 18, 2017 at 10:04 AM, Mina Aslani <aslanim...@gmail.com> wrote:

> Hi,
>
> I get "No space left on device" error in my spark worker:
>
> Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
> java.io.IOException: No space left on device
>
> In my spark cluster, I have one worker and one master.
> My program consumes stream of data from kafka and publishes the result
> into kafka. I set my RDD = null after I finish working, so that
> intermediate shuffle files are removed quickly.
>
> How can I avoid "No space left on device"?
>
> Best regards,
> Mina
>



-- 
I.R

Re: No space left on device

2017-10-17 Thread Chetan Khatri

Process data in micro batch
On 18-Oct-2017 10:36 AM, "Chetan Khatri" <chetan.opensou...@gmail.com>
wrote:

> Your hard drive don't have much space
> On 18-Oct-2017 10:35 AM, "Mina Aslani" <aslanim...@gmail.com> wrote:
>
>> Hi,
>>
>> I get "No space left on device" error in my spark worker:
>>
>> Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
>> java.io.IOException: No space left on device
>>
>> In my spark cluster, I have one worker and one master.
>> My program consumes stream of data from kafka and publishes the result
>> into kafka. I set my RDD = null after I finish working, so that
>> intermediate shuffle files are removed quickly.
>>
>> How can I avoid "No space left on device"?
>>
>> Best regards,
>> Mina
>>
>

Re: No space left on device

2017-10-17 Thread Chetan Khatri

Your hard drive don't have much space
On 18-Oct-2017 10:35 AM, "Mina Aslani" <aslanim...@gmail.com> wrote:

> Hi,
>
> I get "No space left on device" error in my spark worker:
>
> Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
> java.io.IOException: No space left on device
>
> In my spark cluster, I have one worker and one master.
> My program consumes stream of data from kafka and publishes the result
> into kafka. I set my RDD = null after I finish working, so that
> intermediate shuffle files are removed quickly.
>
> How can I avoid "No space left on device"?
>
> Best regards,
> Mina
>

No space left on device

2017-10-17 Thread Mina Aslani

Hi,

I get "No space left on device" error in my spark worker:

Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
java.io.IOException: No space left on device

In my spark cluster, I have one worker and one master.
My program consumes stream of data from kafka and publishes the result into
kafka. I set my RDD = null after I finish working, so that intermediate
shuffle files are removed quickly.

How can I avoid "No space left on device"?

Best regards,
Mina

RE: No space left on device when running graphx job

2015-10-05 Thread Jack Yang

Just something usual as below:



1.  Check the physical disk volume (particularly /tmp folder)

2.  Use spark.local.dir to check the size of the temp files

3.  Add more workers

4.  Decrease partitions (in code)

From: Robin East [mailto:robin.e...@xense.co.uk]
Sent: Saturday, 26 September 2015 12:27 AM
To: Jack Yang
Cc: Ted Yu; Andy Huang; user@spark.apache.org
Subject: Re: No space left on device when running graphx job

Would you mind sharing what your solution was? It would help those on the forum 
who might run into the same problem. Even it it’s a silly ‘gotcha’ it would 
help to know what it was and how you spotted the source of the issue.

Robin



On 25 Sep 2015, at 05:34, Jack Yang <j...@uow.edu.au<mailto:j...@uow.edu.au>> 
wrote:

Hi all,
I resolved the problems.
Thanks folk.
Jack

From: Jack Yang [mailto:j...@uow.edu.au]
Sent: Friday, 25 September 2015 9:57 AM
To: Ted Yu; Andy Huang
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: No space left on device when running graphx job

Also, please see the screenshot below from spark web ui:
This is the snapshot just 5 seconds (I guess) before the job crashed.



From: Jack Yang [mailto:j...@uow.edu.au]
Sent: Friday, 25 September 2015 9:55 AM
To: Ted Yu; Andy Huang
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: No space left on device when running graphx job

Hi, here is the full stack trace:

15/09/25 09:50:14 WARN scheduler.TaskSetManager: Lost task 21088.0 in stage 6.0 
(TID 62230, 192.168.70.129): java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.writeLong(DataOutputStream.java:224)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFile$1$$anonfun$apply$mcV$sp$1.apply$mcVJ$sp(IndexShuffleBlockResolver.scala:86)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFile$1$$anonfun$apply$mcV$sp$1.apply(IndexShuffleBlockResolver.scala:84)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFile$1$$anonfun$apply$mcV$sp$1.apply(IndexShuffleBlockResolver.scala:84)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:168)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFile$1.apply$mcV$sp(IndexShuffleBlockResolver.scala:84)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFile$1.apply(IndexShuffleBlockResolver.scala:80)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFile$1.apply(IndexShuffleBlockResolver.scala:80)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFile(IndexShuffleBlockResolver.scala:88)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


I am using df –i command to monitor the inode usage, which shows the below all 
the time:

Filesystem  Inodes  IUsed  IFree IUse% Mounted on
/dev/sda1  1245184 275424 969760   23% /
udev382148484 3816641% /dev
tmpfs   384505366 3841391% /run
none384505  3 3845021% /run/lock
none384505  1 3845041% /run/shm



From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Thursday, 24 September 2015 9:12 PM
To: Andy Huang
Cc: Jack Yang; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: No space left on device when running graphx job

Andy:
Can you show complete stack trace ?

Have you checked there are enough free inode on the .129 machine ?

Cheers

On Sep 23, 2015, at 11:43 PM, Andy Huang 
<andy.hu...@servian.com.au<mailto:andy.hu...@servian.com.au>> wrote:
Hi Jack,

Are you writing out to disk? Or it sounds like Spark is spilling to disk (RAM 
filled up) and it's running out of disk space.

Cheers
Andy

On Thu, Sep 24, 2015 at 4:29 PM, Jack Yang 
<j...@uow.edu.au<mailto:j...@uow.edu.au>> wrote:
Hi folk,

I ha

Re: No space left on device when running graphx job

2015-09-24 Thread Ted Yu

Andy:
Can you show complete stack trace ?

Have you checked there are enough free inode on the .129 machine ?

Cheers

> On Sep 23, 2015, at 11:43 PM, Andy Huang <andy.hu...@servian.com.au> wrote:
> 
> Hi Jack,
> 
> Are you writing out to disk? Or it sounds like Spark is spilling to disk (RAM 
> filled up) and it's running out of disk space.
> 
> Cheers
> Andy
> 
>> On Thu, Sep 24, 2015 at 4:29 PM, Jack Yang <j...@uow.edu.au> wrote:
>> Hi folk,
>> 
>>  
>> 
>> I have an issue of graphx. (spark: 1.4.0 + 4 machines + 4G memory + 4 CPU 
>> cores)
>> 
>> Basically, I load data using GraphLoader.edgeListFile mthod and then count 
>> number of nodes using: graph.vertices.count() method.
>> 
>> The problem is :
>> 
>>  
>> 
>> Lost task 11972.0 in stage 6.0 (TID 54585, 192.168.70.129): 
>> java.io.IOException: No space left on device
>> 
>> at java.io.FileOutputStream.writeBytes(Native Method)
>> 
>> at java.io.FileOutputStream.write(FileOutputStream.java:345)
>> 
>>  
>> 
>> when I try a small amount of data, the code is working. So I guess the error 
>> comes from the amount of data.
>> 
>> This is how I submit the job:
>> 
>>  
>> 
>> spark-submit --class "myclass"
>> 
>> --master spark://hadoopmaster:7077  (I am using standalone)
>> 
>> --executor-memory 2048M
>> 
>> --driver-java-options "-XX:MaxPermSize=2G" 
>> 
>> --total-executor-cores 4  my.jar
>> 
>>  
>> 
>>  
>> 
>> Any thoughts?
>> 
>> Best regards,
>> 
>> Jack
>> 
> 
> 
> 
> -- 
> Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 | f: 02 
> 9376 0730| m: 0433221979

No space left on device when running graphx job

2015-09-24 Thread Jack Yang

Hi folk,

I have an issue of graphx. (spark: 1.4.0 + 4 machines + 4G memory + 4 CPU cores)
Basically, I load data using GraphLoader.edgeListFile mthod and then count 
number of nodes using: graph.vertices.count() method.
The problem is :

Lost task 11972.0 in stage 6.0 (TID 54585, 192.168.70.129): 
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)

when I try a small amount of data, the code is working. So I guess the error 
comes from the amount of data.
This is how I submit the job:

spark-submit --class "myclass"
--master spark://hadoopmaster:7077  (I am using standalone)
--executor-memory 2048M
--driver-java-options "-XX:MaxPermSize=2G"
--total-executor-cores 4  my.jar


Any thoughts?
Best regards,
Jack

Re: No space left on device when running graphx job

2015-09-24 Thread Andy Huang

Hi Jack,

Are you writing out to disk? Or it sounds like Spark is spilling to disk
(RAM filled up) and it's running out of disk space.

Cheers
Andy

On Thu, Sep 24, 2015 at 4:29 PM, Jack Yang <j...@uow.edu.au> wrote:

> Hi folk,
>
>
>
> I have an issue of graphx. (spark: 1.4.0 + 4 machines + 4G memory + 4 CPU
> cores)
>
> Basically, I load data using GraphLoader.edgeListFile mthod and then count
> number of nodes using: graph.vertices.count() method.
>
> The problem is :
>
>
>
> *Lost task 11972.0 in stage 6.0 (TID 54585, 192.168.70.129):
> java.io.IOException: No space left on device*
>
> *at java.io.FileOutputStream.writeBytes(Native Method)*
>
> *at java.io.FileOutputStream.write(FileOutputStream.java:345)*
>
>
>
> when I try a small amount of data, the code is working. So I guess the
> error comes from the amount of data.
>
> This is how I submit the job:
>
>
>
> spark-submit --class "myclass"
>
> --master spark://hadoopmaster:7077  (I am using standalone)
>
> --executor-memory 2048M
>
> --driver-java-options "-XX:MaxPermSize=2G"
>
> --total-executor-cores 4  my.jar
>
>
>
>
>
> Any thoughts?
>
> Best regards,
>
> Jack
>
>
>



-- 
Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 |
f: 02 9376 0730| m: 0433221979

RE: No space left on device when running graphx job

2015-09-24 Thread Jack Yang

Hi all,
I resolved the problems.
Thanks folk.
Jack

From: Jack Yang [mailto:j...@uow.edu.au]
Sent: Friday, 25 September 2015 9:57 AM
To: Ted Yu; Andy Huang
Cc: user@spark.apache.org
Subject: RE: No space left on device when running graphx job

Also, please see the screenshot below from spark web ui:
This is the snapshot just 5 seconds (I guess) before the job crashed.

[cid:image001.png@01D0F79F.44F6CC70]

From: Jack Yang [mailto:j...@uow.edu.au]
Sent: Friday, 25 September 2015 9:55 AM
To: Ted Yu; Andy Huang
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: No space left on device when running graphx job

Hi, here is the full stack trace:

15/09/25 09:50:14 WARN scheduler.TaskSetManager: Lost task 21088.0 in stage 6.0 
(TID 62230, 192.168.70.129): java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.writeLong(DataOutputStream.java:224)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFile$1$$anonfun$apply$mcV$sp$1.apply$mcVJ$sp(IndexShuffleBlockResolver.scala:86)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFile$1$$anonfun$apply$mcV$sp$1.apply(IndexShuffleBlockResolver.scala:84)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFile$1$$anonfun$apply$mcV$sp$1.apply(IndexShuffleBlockResolver.scala:84)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:168)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFile$1.apply$mcV$sp(IndexShuffleBlockResolver.scala:84)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFile$1.apply(IndexShuffleBlockResolver.scala:80)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver$$anonfun$writeIndexFile$1.apply(IndexShuffleBlockResolver.scala:80)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
at 
org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFile(IndexShuffleBlockResolver.scala:88)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


I am using df –i command to monitor the inode usage, which shows the below all 
the time:

Filesystem  Inodes  IUsed  IFree IUse% Mounted on
/dev/sda1  1245184 275424 969760   23% /
udev382148484 3816641% /dev
tmpfs   384505366 3841391% /run
none384505  3 3845021% /run/lock
none384505  1 3845041% /run/shm



From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Thursday, 24 September 2015 9:12 PM
To: Andy Huang
Cc: Jack Yang; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: No space left on device when running graphx job

Andy:
Can you show complete stack trace ?

Have you checked there are enough free inode on the .129 machine ?

Cheers

On Sep 23, 2015, at 11:43 PM, Andy Huang 
<andy.hu...@servian.com.au<mailto:andy.hu...@servian.com.au>> wrote:
Hi Jack,

Are you writing out to disk? Or it sounds like Spark is spilling to disk (RAM 
filled up) and it's running out of disk space.

Cheers
Andy

On Thu, Sep 24, 2015 at 4:29 PM, Jack Yang 
<j...@uow.edu.au<mailto:j...@uow.edu.au>> wrote:
Hi folk,

I have an issue of graphx. (spark: 1.4.0 + 4 machines + 4G memory + 4 CPU cores)
Basically, I load data using GraphLoader.edgeListFile mthod and then count 
number of nodes using: graph.vertices.count() method.
The problem is :

Lost task 11972.0 in stage 6.0 (TID 54585, 192.168.70.129): 
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)

when I try a small amount of data, the code is working. So I guess the error 
comes from the amount of data.
This is how I submit the job:

spark-submit --class "myclass"
--master spark://hadoopmaster:7077  (I am using standalone)
--executor-memory 2048M
--driver-java-options "-XX:MaxPermSize=2G&quo

Re: SparkContext initialization error- java.io.IOException: No space left on device

2015-09-06 Thread shenyan zhen

Thank you both - yup: the /tmp disk space was filled up:)

On Sun, Sep 6, 2015 at 11:51 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Use the following command if needed:
> df -i /tmp
>
> See
> https://wiki.gentoo.org/wiki/Knowledge_Base:No_space_left_on_device_while_there_is_plenty_of_space_available
>
> On Sun, Sep 6, 2015 at 6:15 AM, Shixiong Zhu <zsxw...@gmail.com> wrote:
>
>> The folder is in "/tmp" by default. Could you use "df -h" to check the
>> free space of /tmp?
>>
>> Best Regards,
>> Shixiong Zhu
>>
>> 2015-09-05 9:50 GMT+08:00 shenyan zhen <shenya...@gmail.com>:
>>
>>> Has anyone seen this error? Not sure which dir the program was trying to
>>> write to.
>>>
>>> I am running Spark 1.4.1, submitting Spark job to Yarn, in yarn-client
>>> mode.
>>>
>>> 15/09/04 21:36:06 ERROR SparkContext: Error adding jar
>>> (java.io.IOException: No space left on device), was the --addJars option
>>> used?
>>>
>>> 15/09/04 21:36:08 ERROR SparkContext: Error initializing SparkContext.
>>>
>>> java.io.IOException: No space left on device
>>>
>>> at java.io.FileOutputStream.writeBytes(Native Method)
>>>
>>> at java.io.FileOutputStream.write(FileOutputStream.java:300)
>>>
>>> at
>>> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:178)
>>>
>>> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:213)
>>>
>>> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:318)
>>>
>>> at
>>> java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:163)
>>>
>>> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:338)
>>>
>>> at
>>> org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:432)
>>>
>>> at
>>> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
>>>
>>> at
>>> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
>>>
>>> at
>>> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
>>>
>>> at
>>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
>>>
>>> at
>>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
>>>
>>> at org.apache.spark.SparkContext.(SparkContext.scala:497)
>>>
>>> Thanks,
>>> Shenyan
>>>
>>
>>
>

Re: SparkContext initialization error- java.io.IOException: No space left on device

2015-09-06 Thread Shixiong Zhu

The folder is in "/tmp" by default. Could you use "df -h" to check the free
space of /tmp?

Best Regards,
Shixiong Zhu

2015-09-05 9:50 GMT+08:00 shenyan zhen <shenya...@gmail.com>:

> Has anyone seen this error? Not sure which dir the program was trying to
> write to.
>
> I am running Spark 1.4.1, submitting Spark job to Yarn, in yarn-client
> mode.
>
> 15/09/04 21:36:06 ERROR SparkContext: Error adding jar
> (java.io.IOException: No space left on device), was the --addJars option
> used?
>
> 15/09/04 21:36:08 ERROR SparkContext: Error initializing SparkContext.
>
> java.io.IOException: No space left on device
>
> at java.io.FileOutputStream.writeBytes(Native Method)
>
> at java.io.FileOutputStream.write(FileOutputStream.java:300)
>
> at
> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:178)
>
> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:213)
>
> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:318)
>
> at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:163)
>
> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:338)
>
> at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:432)
>
> at
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
>
> at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
>
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
>
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
>
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
>
> at org.apache.spark.SparkContext.(SparkContext.scala:497)
>
> Thanks,
> Shenyan
>

Re: SparkContext initialization error- java.io.IOException: No space left on device

2015-09-06 Thread Ted Yu

Use the following command if needed:
df -i /tmp

See
https://wiki.gentoo.org/wiki/Knowledge_Base:No_space_left_on_device_while_there_is_plenty_of_space_available

On Sun, Sep 6, 2015 at 6:15 AM, Shixiong Zhu <zsxw...@gmail.com> wrote:

> The folder is in "/tmp" by default. Could you use "df -h" to check the
> free space of /tmp?
>
> Best Regards,
> Shixiong Zhu
>
> 2015-09-05 9:50 GMT+08:00 shenyan zhen <shenya...@gmail.com>:
>
>> Has anyone seen this error? Not sure which dir the program was trying to
>> write to.
>>
>> I am running Spark 1.4.1, submitting Spark job to Yarn, in yarn-client
>> mode.
>>
>> 15/09/04 21:36:06 ERROR SparkContext: Error adding jar
>> (java.io.IOException: No space left on device), was the --addJars option
>> used?
>>
>> 15/09/04 21:36:08 ERROR SparkContext: Error initializing SparkContext.
>>
>> java.io.IOException: No space left on device
>>
>> at java.io.FileOutputStream.writeBytes(Native Method)
>>
>> at java.io.FileOutputStream.write(FileOutputStream.java:300)
>>
>> at
>> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:178)
>>
>> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:213)
>>
>> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:318)
>>
>> at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:163)
>>
>> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:338)
>>
>> at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:432)
>>
>> at
>> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
>>
>> at
>> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
>>
>> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
>>
>> at
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
>>
>> at
>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
>>
>> at org.apache.spark.SparkContext.(SparkContext.scala:497)
>>
>> Thanks,
>> Shenyan
>>
>
>

SparkContext initialization error- java.io.IOException: No space left on device

2015-09-04 Thread shenyan zhen

Has anyone seen this error? Not sure which dir the program was trying to
write to.

I am running Spark 1.4.1, submitting Spark job to Yarn, in yarn-client mode.

15/09/04 21:36:06 ERROR SparkContext: Error adding jar
(java.io.IOException: No space left on device), was the --addJars option
used?

15/09/04 21:36:08 ERROR SparkContext: Error initializing SparkContext.

java.io.IOException: No space left on device

at java.io.FileOutputStream.writeBytes(Native Method)

at java.io.FileOutputStream.write(FileOutputStream.java:300)

at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:178)

at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:213)

at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:318)

at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:163)

at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:338)

at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:432)

at
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)

at
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)

at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)

at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)

at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)

at org.apache.spark.SparkContext.(SparkContext.scala:497)

Thanks,
Shenyan

Re: java.io.IOException: No space left on device--regd.

2015-07-06 Thread Akhil Das

While the job is running, just look in the directory and see whats the root
cause of it (is it the logs? is it the shuffle? etc). Here's a few
configuration options which you can try:

- Disable shuffle : spark.shuffle.spill=false (It might end up in OOM)
- Enable log rotation:

sparkConf.set(spark.executor.logs.rolling.strategy, size)
.set(spark.executor.logs.rolling.size.maxBytes, 1024)
.set(spark.executor.logs.rolling.maxRetainedFiles, 3)


Thanks
Best Regards

On Mon, Jul 6, 2015 at 10:44 AM, Devarajan Srinivasan 
devathecool1...@gmail.com wrote:

 Hi ,

  I am trying to run an ETL on spark which involves expensive shuffle
 operation. Basically I require a self-join to be performed on a
 sparkDataFrame RDD . The job runs fine for around 15 hours and when the
 stage(which performs the sef-join) is about to complete, I get a 
 *java.io.IOException:
 No space left on device*. I initially thought this could be due  to
 *spark.local.dir* pointing to */tmp* directory which was configured with
 *2GB* of space, since this job requires expensive shuffles,spark
 requires  more space to write the  shuffle files. Hence I configured
 *spark.local.dir* to point to a different directory which has *1TB* of
 space. But still I get the same *no space left exception*. What could be
 the root cause of this issue?


 Thanks in advance.

 *Exception stacktrace:*

 *java.io.IOException: No space left on device
   at java.io.FileOutputStream.writeBytes(Native Method)
   at java.io.FileOutputStream.write(FileOutputStream.java:345)
   at 
 org.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream$$anonfun$write$3.apply$mcV$sp(BlockObjectWriter.scala:87)
   at org.apache.spark.storage.DiskBlockObjectWriter.org 
 http://org.apache.spark.storage.DiskBlockObjectWriter.org$apache$spark$storage$DiskBlockObjectWriter$$callWithTiming(BlockObjectWriter.scala:229)
   at 
 org.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream.write(BlockObjectWriter.scala:87)
   at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
   at 
 org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:297)
   at 
 org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:244)
   at 
 org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:99)
   at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
   at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
   at 
 java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1285)
   at 
 java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)
   at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
   at 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:204)
   at 
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:370)
   at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
   at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)*

Re: java.io.IOException: No space left on device--regd.

2015-07-06 Thread Akhil Das

You can also set these in the spark-env.sh file :

export SPARK_WORKER_DIR=/mnt/spark/
export SPARK_LOCAL_DIR=/mnt/spark/



Thanks
Best Regards

On Mon, Jul 6, 2015 at 12:29 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 While the job is running, just look in the directory and see whats the
 root cause of it (is it the logs? is it the shuffle? etc). Here's a few
 configuration options which you can try:

 - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM)
 - Enable log rotation:

 sparkConf.set(spark.executor.logs.rolling.strategy, size)
 .set(spark.executor.logs.rolling.size.maxBytes, 1024)
 .set(spark.executor.logs.rolling.maxRetainedFiles, 3)


 Thanks
 Best Regards

 On Mon, Jul 6, 2015 at 10:44 AM, Devarajan Srinivasan 
 devathecool1...@gmail.com wrote:

 Hi ,

  I am trying to run an ETL on spark which involves expensive shuffle
 operation. Basically I require a self-join to be performed on a
 sparkDataFrame RDD . The job runs fine for around 15 hours and when the
 stage(which performs the sef-join) is about to complete, I get a 
 *java.io.IOException:
 No space left on device*. I initially thought this could be due  to
 *spark.local.dir* pointing to */tmp* directory which was configured with
 *2GB* of space, since this job requires expensive shuffles,spark
 requires  more space to write the  shuffle files. Hence I configured
 *spark.local.dir* to point to a different directory which has *1TB* of
 space. But still I get the same *no space left exception*. What could be
 the root cause of this issue?


 Thanks in advance.

 *Exception stacktrace:*

 *java.io.IOException: No space left on device
  at java.io.FileOutputStream.writeBytes(Native Method)
  at java.io.FileOutputStream.write(FileOutputStream.java:345)
  at 
 org.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream$$anonfun$write$3.apply$mcV$sp(BlockObjectWriter.scala:87)
  at org.apache.spark.storage.DiskBlockObjectWriter.org 
 http://org.apache.spark.storage.DiskBlockObjectWriter.org$apache$spark$storage$DiskBlockObjectWriter$$callWithTiming(BlockObjectWriter.scala:229)
  at 
 org.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream.write(BlockObjectWriter.scala:87)
  at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
  at 
 org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:297)
  at 
 org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:244)
  at 
 org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:99)
  at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
  at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
  at 
 java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1285)
  at 
 java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
  at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
  at 
 java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)
  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)
  at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
  at 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:204)
  at 
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:370)
  at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
  at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
  at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
  at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
  at org.apache.spark.scheduler.Task.run(Task.scala:64)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
  at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)*

java.io.IOException: No space left on device--regd.

2015-07-05 Thread Devarajan Srinivasan

Hi ,

 I am trying to run an ETL on spark which involves expensive shuffle
operation. Basically I require a self-join to be performed on a
sparkDataFrame RDD . The job runs fine for around 15 hours and when the
stage(which performs the sef-join) is about to complete, I get a
*java.io.IOException:
No space left on device*. I initially thought this could be due  to
*spark.local.dir* pointing to */tmp* directory which was configured with
*2GB* of space, since this job requires expensive shuffles,spark requires
more space to write the  shuffle files. Hence I configured *spark.local.dir*
to point to a different directory which has *1TB* of space. But still I get
the same *no space left exception*. What could be the root cause of this
issue?


Thanks in advance.

*Exception stacktrace:*

*java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at 
org.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream$$anonfun$write$3.apply$mcV$sp(BlockObjectWriter.scala:87)
at org.apache.spark.storage.DiskBlockObjectWriter.org
http://org.apache.spark.storage.DiskBlockObjectWriter.org$apache$spark$storage$DiskBlockObjectWriter$$callWithTiming(BlockObjectWriter.scala:229)
at 
org.apache.spark.storage.DiskBlockObjectWriter$TimeTrackingOutputStream.write(BlockObjectWriter.scala:87)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at 
org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:297)
at 
org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:244)
at 
org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:99)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at 
java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1285)
at 
java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:204)
at 
org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:370)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)*

No space left on device??

2015-05-06 Thread Yifan LI

Hi,

I am running my graphx application on Spark, but it failed since there is an 
error on one executor node(on which available hdfs space is small) that “no 
space left on device”.

I can understand why it happened, because my vertex(-attribute) rdd was 
becoming bigger and bigger during computation…, so maybe sometime the request 
on that node was too bigger than available space.

But, is there any way to avoid this kind of error? I am sure that the overall 
disk space of all nodes is enough for my application.

Thanks in advance!



Best,
Yifan LI

Re: No space left on device??

2015-05-06 Thread Saisai Shao

I think for executor distribution, normally On YARN mode, RM tries its best
to evenly distribute container if don't explicitly specify the preferred
host. For standalone mode, one node only has one executor normally, so
executor distribution is not a big problem normally.

The problem of data skew will lead to unbalanced task execution time and
intermediate data spill. If some of your nodes processed large parts of
data, these nodes will have more spilled data and will easily meet out of
disk space.  I'm not sure if you actually meet such problem, this problem
is hard to solve, need to fix from data and application's implementation
level IMO.

2015-05-06 21:21 GMT+08:00 Yifan LI iamyifa...@gmail.com:

 Yes, you are right. For now I have to say the workload/executor is
 distributed evenly…so, like you said, it is difficult to improve the
 situation.

 However, have you any idea of how to make a *skew* data/executor
 distribution?



 Best,
 Yifan LI





 On 06 May 2015, at 15:13, Saisai Shao sai.sai.s...@gmail.com wrote:

 I think it depends on your workload and executor distribution, if your
 workload is evenly distributed without any big data skew, and executors are
 evenly distributed on each nodes, the storage usage of each node is nearly
 the same. Spark itself cannot rebalance the storage overhead as you
 mentioned.

 2015-05-06 21:09 GMT+08:00 Yifan LI iamyifa...@gmail.com:

 Thanks, Shao. :-)

 I am wondering if the spark will rebalance the storage overhead in
 runtime…since still there is some available space on other nodes.


 Best,
 Yifan LI





 On 06 May 2015, at 14:57, Saisai Shao sai.sai.s...@gmail.com wrote:

 I think you could configure multiple disks through spark.local.dir,
 default is /tmp. Anyway if your intermediate data is larger than available
 disk space, still will meet this issue.

 spark.local.dir/tmpDirectory to use for scratch space in Spark,
 including map output files and RDDs that get stored on disk. This should be
 on a fast, local disk in your system. It can also be a comma-separated list
 of multiple directories on different disks. NOTE: In Spark 1.0 and later
 this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or
 LOCAL_DIRS (YARN) environment variables set by the cluster manager.

 2015-05-06 20:35 GMT+08:00 Yifan LI iamyifa...@gmail.com:

 Hi,

 I am running my graphx application on Spark, but it failed since there
 is an error on one executor node(on which available hdfs space is small)
 that “no space left on device”.

 I can understand why it happened, because my vertex(-attribute) rdd was
 becoming bigger and bigger during computation…, so maybe sometime the
 request on that node was too bigger than available space.

 But, is there any way to avoid this kind of error? I am sure that the
 overall disk space of all nodes is enough for my application.

 Thanks in advance!



 Best,
 Yifan LI

Re: No space left on device??

2015-05-06 Thread Yifan LI

Thanks, Shao. :-)

I am wondering if the spark will rebalance the storage overhead in 
runtime…since still there is some available space on other nodes.


Best,
Yifan LI





 On 06 May 2015, at 14:57, Saisai Shao sai.sai.s...@gmail.com wrote:
 
 I think you could configure multiple disks through spark.local.dir, default 
 is /tmp. Anyway if your intermediate data is larger than available disk 
 space, still will meet this issue.
 
 spark.local.dir   /tmpDirectory to use for scratch space in Spark, 
 including map output files and RDDs that get stored on disk. This should be 
 on a fast, local disk in your system. It can also be a comma-separated list 
 of multiple directories on different disks. NOTE: In Spark 1.0 and later this 
 will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS 
 (YARN) environment variables set by the cluster manager.
 
 2015-05-06 20:35 GMT+08:00 Yifan LI iamyifa...@gmail.com 
 mailto:iamyifa...@gmail.com:
 Hi,
 
 I am running my graphx application on Spark, but it failed since there is an 
 error on one executor node(on which available hdfs space is small) that “no 
 space left on device”.
 
 I can understand why it happened, because my vertex(-attribute) rdd was 
 becoming bigger and bigger during computation…, so maybe sometime the request 
 on that node was too bigger than available space.
 
 But, is there any way to avoid this kind of error? I am sure that the overall 
 disk space of all nodes is enough for my application.
 
 Thanks in advance!
 
 
 
 Best,
 Yifan LI

Re: No space left on device??

2015-05-06 Thread Saisai Shao

I think you could configure multiple disks through spark.local.dir, default
is /tmp. Anyway if your intermediate data is larger than available disk
space, still will meet this issue.

spark.local.dir/tmpDirectory to use for scratch space in Spark, including
map output files and RDDs that get stored on disk. This should be on a
fast, local disk in your system. It can also be a comma-separated list of
multiple directories on different disks. NOTE: In Spark 1.0 and later this
will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS
(YARN) environment variables set by the cluster manager.

2015-05-06 20:35 GMT+08:00 Yifan LI iamyifa...@gmail.com:

 Hi,

 I am running my graphx application on Spark, but it failed since there is
 an error on one executor node(on which available hdfs space is small) that
 “no space left on device”.

 I can understand why it happened, because my vertex(-attribute) rdd was
 becoming bigger and bigger during computation…, so maybe sometime the
 request on that node was too bigger than available space.

 But, is there any way to avoid this kind of error? I am sure that the
 overall disk space of all nodes is enough for my application.

 Thanks in advance!



 Best,
 Yifan LI

Re: No space left on device??

2015-05-06 Thread Yifan LI

Yes, you are right. For now I have to say the workload/executor is distributed 
evenly…so, like you said, it is difficult to improve the situation.

However, have you any idea of how to make a *skew* data/executor distribution? 



Best,
Yifan LI





 On 06 May 2015, at 15:13, Saisai Shao sai.sai.s...@gmail.com wrote:
 
 I think it depends on your workload and executor distribution, if your 
 workload is evenly distributed without any big data skew, and executors are 
 evenly distributed on each nodes, the storage usage of each node is nearly 
 the same. Spark itself cannot rebalance the storage overhead as you mentioned.
 
 2015-05-06 21:09 GMT+08:00 Yifan LI iamyifa...@gmail.com 
 mailto:iamyifa...@gmail.com:
 Thanks, Shao. :-)
 
 I am wondering if the spark will rebalance the storage overhead in 
 runtime…since still there is some available space on other nodes.
 
 
 Best,
 Yifan LI
 
 
 
 
 
 On 06 May 2015, at 14:57, Saisai Shao sai.sai.s...@gmail.com 
 mailto:sai.sai.s...@gmail.com wrote:
 
 I think you could configure multiple disks through spark.local.dir, default 
 is /tmp. Anyway if your intermediate data is larger than available disk 
 space, still will meet this issue.
 
 spark.local.dir  /tmpDirectory to use for scratch space in Spark, 
 including map output files and RDDs that get stored on disk. This should be 
 on a fast, local disk in your system. It can also be a comma-separated list 
 of multiple directories on different disks. NOTE: In Spark 1.0 and later 
 this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS 
 (YARN) environment variables set by the cluster manager.
 
 2015-05-06 20:35 GMT+08:00 Yifan LI iamyifa...@gmail.com 
 mailto:iamyifa...@gmail.com:
 Hi,
 
 I am running my graphx application on Spark, but it failed since there is an 
 error on one executor node(on which available hdfs space is small) that “no 
 space left on device”.
 
 I can understand why it happened, because my vertex(-attribute) rdd was 
 becoming bigger and bigger during computation…, so maybe sometime the 
 request on that node was too bigger than available space.
 
 But, is there any way to avoid this kind of error? I am sure that the 
 overall disk space of all nodes is enough for my application.
 
 Thanks in advance!
 
 
 
 Best,
 Yifan LI

Re: No space left on device??

2015-05-06 Thread Saisai Shao

I think it depends on your workload and executor distribution, if your
workload is evenly distributed without any big data skew, and executors are
evenly distributed on each nodes, the storage usage of each node is nearly
the same. Spark itself cannot rebalance the storage overhead as you
mentioned.

2015-05-06 21:09 GMT+08:00 Yifan LI iamyifa...@gmail.com:

 Thanks, Shao. :-)

 I am wondering if the spark will rebalance the storage overhead in
 runtime…since still there is some available space on other nodes.


 Best,
 Yifan LI





 On 06 May 2015, at 14:57, Saisai Shao sai.sai.s...@gmail.com wrote:

 I think you could configure multiple disks through spark.local.dir,
 default is /tmp. Anyway if your intermediate data is larger than available
 disk space, still will meet this issue.

 spark.local.dir/tmpDirectory to use for scratch space in Spark,
 including map output files and RDDs that get stored on disk. This should be
 on a fast, local disk in your system. It can also be a comma-separated list
 of multiple directories on different disks. NOTE: In Spark 1.0 and later
 this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or
 LOCAL_DIRS (YARN) environment variables set by the cluster manager.

 2015-05-06 20:35 GMT+08:00 Yifan LI iamyifa...@gmail.com:

 Hi,

 I am running my graphx application on Spark, but it failed since there is
 an error on one executor node(on which available hdfs space is small) that
 “no space left on device”.

 I can understand why it happened, because my vertex(-attribute) rdd was
 becoming bigger and bigger during computation…, so maybe sometime the
 request on that node was too bigger than available space.

 But, is there any way to avoid this kind of error? I am sure that the
 overall disk space of all nodes is enough for my application.

 Thanks in advance!



 Best,
 Yifan LI

Re: java.io.IOException: No space left on device while doing repartitioning in Spark

2015-05-05 Thread Akhil Das

It could be filling up your /tmp directory. You need to set your
spark.local.dir or you can also specify SPARK_WORKER_DIR to another
location which has sufficient space.

Thanks
Best Regards

On Mon, May 4, 2015 at 7:27 PM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I am getting No space left on device exception when doing repartitioning
  of approx. 285 MB of data  while these is still 2 GB space left ??

 does it mean that repartitioning needs more space (more than 2 GB) for
 repartitioning of 285 MB of data ??

 best,
 /Shahab

 java.io.IOException: No space left on device
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
   at sun.nio.ch.IOUtil.write(IOUtil.java:51)
   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205)
   at 
 sun.nio.ch.FileChannelImpl.transferToTrustedChannel(FileChannelImpl.java:473)
   at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:569)
   at org.apache.spark.util.Utils$.copyStream(Utils.scala:331)
   at 
 org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at 
 org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:728)
   at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)

java.io.IOException: No space left on device while doing repartitioning in Spark

2015-05-04 Thread shahab

Hi,

I am getting No space left on device exception when doing repartitioning
 of approx. 285 MB of data  while these is still 2 GB space left ??

does it mean that repartitioning needs more space (more than 2 GB) for
repartitioning of 285 MB of data ??

best,
/Shahab

java.io.IOException: No space left on device
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205)
at 
sun.nio.ch.FileChannelImpl.transferToTrustedChannel(FileChannelImpl.java:473)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:569)
at org.apache.spark.util.Utils$.copyStream(Utils.scala:331)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:728)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Re: java.io.IOException: No space left on device while doing repartitioning in Spark

2015-05-04 Thread Ted Yu

See
https://wiki.gentoo.org/wiki/Knowledge_Base:No_space_left_on_device_while_there_is_plenty_of_space_available

What's the value for spark.local.dir property ?

Cheers

On Mon, May 4, 2015 at 6:57 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I am getting No space left on device exception when doing repartitioning
  of approx. 285 MB of data  while these is still 2 GB space left ??

 does it mean that repartitioning needs more space (more than 2 GB) for
 repartitioning of 285 MB of data ??

 best,
 /Shahab

 java.io.IOException: No space left on device
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
   at sun.nio.ch.IOUtil.write(IOUtil.java:51)
   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205)
   at 
 sun.nio.ch.FileChannelImpl.transferToTrustedChannel(FileChannelImpl.java:473)
   at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:569)
   at org.apache.spark.util.Utils$.copyStream(Utils.scala:331)
   at 
 org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at 
 org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:728)
   at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)

Re: java.io.IOException: No space left on device

2015-04-29 Thread Dean Wampler

Or multiple volumes. The LOCAL_DIRS (YARN) and SPARK_LOCAL_DIRS (Mesos,
Standalone) environment variables and the spark.local.dir property control
where temporary data is written. The default is /tmp.

See
http://spark.apache.org/docs/latest/configuration.html#runtime-environment
for more details.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Wed, Apr 29, 2015 at 6:19 AM, Anshul Singhle ans...@betaglide.com
wrote:

 Do you have multiple disks? Maybe your work directory is not in the right
 disk?

 On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi selim.na...@gmail.com
 wrote:

 Hi,

 I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf
 output,the training data is a file containing 156060 (size 8.1M).

 The problem is that when trying to presist a partition into memory and
 there
 is not enought memory, the partition is persisted on disk and despite
 Having
 229G of free disk space, I got  No space left on device..

 This is how I'm running the program :

 ./spark-submit --class com.custom.sentimentAnalysis.MainPipeline --master
 local[2] --driver-memory 5g ml_pipeline.jar labeledTrainData.tsv
 testData.tsv

 And this is a part of the log:



 If you need more informations, please let me know.
 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-tp22702.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: java.io.IOException: No space left on device

2015-04-29 Thread Dean Wampler

Makes sense. / is where /tmp would be. However, 230G should be plenty of
space. If you have INFO logging turned on (set in
$SPARK_HOME/conf/log4j.properties), you'll see messages about saving data
to disk that will list sizes. The web console also has some summary
information about this.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Wed, Apr 29, 2015 at 6:25 AM, selim namsi selim.na...@gmail.com wrote:

 This is the output of df -h so as you can see I'm using only one disk
 mounted on /

 df -h
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/sda8   276G   34G  229G  13% /none4.0K 0  4.0K   0% 
 /sys/fs/cgroup
 udev7.8G  4.0K  7.8G   1% /dev
 tmpfs   1.6G  1.4M  1.6G   1% /runnone5.0M 0  5.0M   
 0% /run/locknone7.8G   37M  7.8G   1% /run/shmnone
 100M   40K  100M   1% /run/user
 /dev/sda1   496M   55M  442M  11% /boot/efi

 Also when running the program, I noticed that the Used% disk space related
 to the partition mounted on / was growing very fast

 On Wed, Apr 29, 2015 at 12:19 PM Anshul Singhle ans...@betaglide.com
 wrote:

 Do you have multiple disks? Maybe your work directory is not in the right
 disk?

 On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi selim.na...@gmail.com
 wrote:

 Hi,

 I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf
 output,the training data is a file containing 156060 (size 8.1M).

 The problem is that when trying to presist a partition into memory and
 there
 is not enought memory, the partition is persisted on disk and despite
 Having
 229G of free disk space, I got  No space left on device..

 This is how I'm running the program :

 ./spark-submit --class com.custom.sentimentAnalysis.MainPipeline --master
 local[2] --driver-memory 5g ml_pipeline.jar labeledTrainData.tsv
 testData.tsv

 And this is a part of the log:



 If you need more informations, please let me know.
 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-tp22702.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

java.io.IOException: No space left on device

2015-04-29 Thread Selim Namsi

Hi,

I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf
output,the training data is a file containing 156060 (size 8.1M).

The problem is that when trying to presist a partition into memory and there
is not enought memory, the partition is persisted on disk and despite Having
229G of free disk space, I got  No space left on device..

This is how I'm running the program : 

./spark-submit --class com.custom.sentimentAnalysis.MainPipeline --master
local[2] --driver-memory 5g ml_pipeline.jar labeledTrainData.tsv
testData.tsv

And this is a part of the log:



If you need more informations, please let me know.
Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-tp22702.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: java.io.IOException: No space left on device

2015-04-29 Thread Anshul Singhle

Do you have multiple disks? Maybe your work directory is not in the right
disk?

On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi selim.na...@gmail.com wrote:

 Hi,

 I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf
 output,the training data is a file containing 156060 (size 8.1M).

 The problem is that when trying to presist a partition into memory and
 there
 is not enought memory, the partition is persisted on disk and despite
 Having
 229G of free disk space, I got  No space left on device..

 This is how I'm running the program :

 ./spark-submit --class com.custom.sentimentAnalysis.MainPipeline --master
 local[2] --driver-memory 5g ml_pipeline.jar labeledTrainData.tsv
 testData.tsv

 And this is a part of the log:



 If you need more informations, please let me know.
 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-tp22702.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: java.io.IOException: No space left on device

2015-04-29 Thread selim namsi

This is the output of df -h so as you can see I'm using only one disk
mounted on /

df -h
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda8   276G   34G  229G  13% /none4.0K 0
4.0K   0% /sys/fs/cgroup
udev7.8G  4.0K  7.8G   1% /dev
tmpfs   1.6G  1.4M  1.6G   1% /runnone5.0M 0
5.0M   0% /run/locknone7.8G   37M  7.8G   1% /run/shmnone
  100M   40K  100M   1% /run/user
/dev/sda1   496M   55M  442M  11% /boot/efi

Also when running the program, I noticed that the Used% disk space related
to the partition mounted on / was growing very fast

On Wed, Apr 29, 2015 at 12:19 PM Anshul Singhle ans...@betaglide.com
wrote:

 Do you have multiple disks? Maybe your work directory is not in the right
 disk?

 On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi selim.na...@gmail.com
 wrote:

 Hi,

 I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf
 output,the training data is a file containing 156060 (size 8.1M).

 The problem is that when trying to presist a partition into memory and
 there
 is not enought memory, the partition is persisted on disk and despite
 Having
 229G of free disk space, I got  No space left on device..

 This is how I'm running the program :

 ./spark-submit --class com.custom.sentimentAnalysis.MainPipeline --master
 local[2] --driver-memory 5g ml_pipeline.jar labeledTrainData.tsv
 testData.tsv

 And this is a part of the log:



 If you need more informations, please let me know.
 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-tp22702.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: java.io.IOException: No space left on device

2015-04-29 Thread selim namsi

Sorry I put the log messages when creating the thread in
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-td22702.html
but I forgot that raw messages will not be sent in emails.

So this is the log related to the error :

15/04/29 02:48:50 INFO CacheManager: Partition rdd_19_0 not found, computing it
15/04/29 02:48:50 INFO BlockManager: Found block rdd_15_0 locally
15/04/29 02:48:50 INFO CacheManager: Partition rdd_19_1 not found, computing it
15/04/29 02:48:50 INFO BlockManager: Found block rdd_15_1 locally
15/04/29 02:49:13 WARN MemoryStore: Not enough space to cache rdd_19_1
in memory! (computed 1106.0 MB so far)
15/04/29 02:49:13 INFO MemoryStore: Memory use = 234.0 MB (blocks) +
2.6 GB (scratch space shared across 2 thread(s)) = 2.9 GB. Storage
limit = 3.1 GB.
15/04/29 02:49:13 WARN CacheManager: Persisting partition rdd_19_1 to
disk instead.
15/04/29 02:49:28 WARN MemoryStore: Not enough space to cache rdd_19_0
in memory! (computed 1745.7 MB so far)
15/04/29 02:49:28 INFO MemoryStore: Memory use = 234.0 MB (blocks) +
2.6 GB (scratch space shared across 2 thread(s)) = 2.9 GB. Storage
limit = 3.1 GB.
15/04/29 02:49:28 WARN CacheManager: Persisting partition rdd_19_0 to
disk instead.
15/04/29 03:56:12 WARN BlockManager: Putting block rdd_19_0 failed
15/04/29 03:56:12 WARN BlockManager: Putting block rdd_19_1 failed
15/04/29 03:56:12 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 7)
java.io.IOException: No space left on *device

*It seems that the partitions rdd_19_0 and rdd_9=19_1 needs both of
them  2.9 GB.

Thanks


On Wed, Apr 29, 2015 at 12:34 PM Dean Wampler deanwamp...@gmail.com wrote:

 Makes sense. / is where /tmp would be. However, 230G should be plenty of
 space. If you have INFO logging turned on (set in
 $SPARK_HOME/conf/log4j.properties), you'll see messages about saving data
 to disk that will list sizes. The web console also has some summary
 information about this.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Apr 29, 2015 at 6:25 AM, selim namsi selim.na...@gmail.com
 wrote:

 This is the output of df -h so as you can see I'm using only one disk
 mounted on /

 df -h
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/sda8   276G   34G  229G  13% /none4.0K 0  4.0K   0% 
 /sys/fs/cgroup
 udev7.8G  4.0K  7.8G   1% /dev
 tmpfs   1.6G  1.4M  1.6G   1% /runnone5.0M 0  5.0M   
 0% /run/locknone7.8G   37M  7.8G   1% /run/shmnone
 100M   40K  100M   1% /run/user
 /dev/sda1   496M   55M  442M  11% /boot/efi

 Also when running the program, I noticed that the Used% disk space
 related to the partition mounted on / was growing very fast

 On Wed, Apr 29, 2015 at 12:19 PM Anshul Singhle ans...@betaglide.com
 wrote:

 Do you have multiple disks? Maybe your work directory is not in the
 right disk?

 On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi selim.na...@gmail.com
 wrote:

 Hi,

 I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf
 output,the training data is a file containing 156060 (size 8.1M).

 The problem is that when trying to presist a partition into memory and
 there
 is not enought memory, the partition is persisted on disk and despite
 Having
 229G of free disk space, I got  No space left on device..

 This is how I'm running the program :

 ./spark-submit --class com.custom.sentimentAnalysis.MainPipeline
 --master
 local[2] --driver-memory 5g ml_pipeline.jar labeledTrainData.tsv
 testData.tsv

 And this is a part of the log:



 If you need more informations, please let me know.
 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-tp22702.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

FileNotFoundException (No space left on device) writing to S3

2014-08-27 Thread Daniil Osipov

Hello,

I've been seeing the following errors when trying to save to S3:

Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage fail
ure: Task 4058 in stage 2.1 failed 4 times, most recent failure: Lost task
4058.3 in stag
e 2.1 (TID 12572, ip-10-81-151-40.ec2.internal):
java.io.FileNotFoundException: /mnt/spa$
k/spark-local-20140827191008-05ae/0c/shuffle_1_7570_5768 (No space left on
device)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.init(FileOutputStream.java:221)

org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:107)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:175$

org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuff$
eWriter.scala:67)

org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuff$
eWriter.scala:65)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65$

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

DF tells me there is plenty of space left on the worker node:
root@ip-10-81-151-40 ~]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/xvda17.9G  4.6G  3.3G  59% /
tmpfs 7.4G 0  7.4G   0% /dev/shm
/dev/xvdb  37G   11G   25G  30% /mnt
/dev/xvdf  37G  9.5G   26G  27% /mnt2

Any suggestions?
Dan

No space left on device

2014-08-09 Thread kmatzen

I need some configuration / debugging recommendations to work around no
space left on device.  I am completely new to Spark, but I have some
experience with Hadoop.

I have a task where I read images stored in sequence files from s3://,
process them with a map in scala, and write the result back to s3://.  I
have about 15 r3.8xlarge instances allocated with the included ec2 script. 
The input data is about 1.5 TB and I expect the output to be similarly
sized.  15 r3.8xlarge instances give me about 3 TB of RAM and 9 TB of
storage, so hopefully more than enough for this task.

What happens is that it takes about an hour to read in the input from S3. 
Once that is complete, then it begins to process the images and several
succeed.  However, quickly, the job fails with no space left on device. 
By time I can ssh into one of the machines that reported the error, temp
files have already been cleaned up.  I don't see any more detailed messages
in the slave logs.  I have not yet changed the logging configuration from
the default.

The S3 input and output are cached in /mnt/ephemeral-hdfs/s3 and
/mnt2/ephemeral-hdfs/s3 (I see mostly input files at the time of failure,
but maybe 1 output file per slave).  Shuffle files are generated in
/mnt/spark/something and /mnt2/spark/something (they were cleaned up
once the job failed and I don't remember the directory that I saw while it
was still running).  I checked the disk utilization for a few slaves while
running the pipeline and they were pretty far away from being full.  But the
failure probably came from a slave that was overloaded from a shard
imbalance (but why would that happen on read - map - write?).

What other things might I need to configure to prevent this error?  What
logging options do people recommend?  Is there an easy way to diagnose spark
failures from the web interface like with Hadoop?

I need to do some more testing to make sure I'm not emitting a giant image
for a malformed input image, but I figured I'd post this question early in
case anyone had any recommendations.

BTW, why does a map-only job need to shuffle?  I was expecting it to
pipeline the transfer in from S3 operation, the actual computation
operation, and the transfer back out to S3 operation rather than doing
everything serially with a giant disk footprint.  Actually, I was thinking
it would fuse all three operations into a single stage.  Is that not what
Spark does?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-space-left-on-device-tp11829.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: No space left on device

2014-08-09 Thread Matei Zaharia

Your map-only job should not be shuffling, but if you want to see what's 
running, look at the web UI at http://driver:4040. In fact the job should not 
even write stuff to disk except inasmuch as the Hadoop S3 library might build 
up blocks locally before sending them on.

My guess is that it's not /mnt or /mnt2 that get filled, but the root volume, 
/, either with logs or with temp files created by the Hadoop S3 library. You 
can check this by running df while the job is executing. (Tools like Ganglia 
can probably also log this.) If it is the logs, you can symlink the spark/logs 
directory to someplace on /mnt instead. If it's /tmp, you can set 
java.io.tmpdir to another directory in Spark's JVM options.

Matei

On August 8, 2014 at 11:02:48 PM, kmatzen (kmat...@gmail.com) wrote:

I need some configuration / debugging recommendations to work around no 
space left on device. I am completely new to Spark, but I have some 
experience with Hadoop. 

I have a task where I read images stored in sequence files from s3://, 
process them with a map in scala, and write the result back to s3://. I 
have about 15 r3.8xlarge instances allocated with the included ec2 script. 
The input data is about 1.5 TB and I expect the output to be similarly 
sized. 15 r3.8xlarge instances give me about 3 TB of RAM and 9 TB of 
storage, so hopefully more than enough for this task. 

What happens is that it takes about an hour to read in the input from S3. 
Once that is complete, then it begins to process the images and several 
succeed. However, quickly, the job fails with no space left on device. 
By time I can ssh into one of the machines that reported the error, temp 
files have already been cleaned up. I don't see any more detailed messages 
in the slave logs. I have not yet changed the logging configuration from 
the default. 

The S3 input and output are cached in /mnt/ephemeral-hdfs/s3 and 
/mnt2/ephemeral-hdfs/s3 (I see mostly input files at the time of failure, 
but maybe 1 output file per slave). Shuffle files are generated in 
/mnt/spark/something and /mnt2/spark/something (they were cleaned up 
once the job failed and I don't remember the directory that I saw while it 
was still running). I checked the disk utilization for a few slaves while 
running the pipeline and they were pretty far away from being full. But the 
failure probably came from a slave that was overloaded from a shard 
imbalance (but why would that happen on read - map - write?). 

What other things might I need to configure to prevent this error? What 
logging options do people recommend? Is there an easy way to diagnose spark 
failures from the web interface like with Hadoop? 

I need to do some more testing to make sure I'm not emitting a giant image 
for a malformed input image, but I figured I'd post this question early in 
case anyone had any recommendations. 

BTW, why does a map-only job need to shuffle? I was expecting it to 
pipeline the transfer in from S3 operation, the actual computation 
operation, and the transfer back out to S3 operation rather than doing 
everything serially with a giant disk footprint. Actually, I was thinking 
it would fuse all three operations into a single stage. Is that not what 
Spark does? 





-- 
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-space-left-on-device-tp11829.html
 
Sent from the Apache Spark User List mailing list archive at Nabble.com. 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org

Re: No space left on device

2014-08-09 Thread Jim Donahue

Root partitions on AWS instances tend to be small (for example, an m1.large 
instance has 2 420 GB drives, but only a 10 GB root partition).  Matei's 
probably right on about this - just need to be careful where things like the 
logs get stored.

From: Matei Zaharia matei.zaha...@gmail.commailto:matei.zaha...@gmail.com
Date: Saturday, August 9, 2014 at 1:48 PM
To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org 
u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org, 
kmatzen kmat...@gmail.commailto:kmat...@gmail.com
Subject: Re: No space left on device

Your map-only job should not be shuffling, but if you want to see what's 
running, look at the web UI at http://driver:4040. In fact the job should not 
even write stuff to disk except inasmuch as the Hadoop S3 library might build 
up blocks locally before sending them on.

My guess is that it's not /mnt or /mnt2 that get filled, but the root volume, 
/, either with logs or with temp files created by the Hadoop S3 library. You 
can check this by running df while the job is executing. (Tools like Ganglia 
can probably also log this.) If it is the logs, you can symlink the spark/logs 
directory to someplace on /mnt instead. If it's /tmp, you can set 
java.io.tmpdir to another directory in Spark's JVM options.

Matei

On August 8, 2014 at 11:02:48 PM, kmatzen 
(kmat...@gmail.commailto:kmat...@gmail.com) wrote:

I need some configuration / debugging recommendations to work around no
space left on device. I am completely new to Spark, but I have some
experience with Hadoop.

I have a task where I read images stored in sequence files from s3://,
process them with a map in scala, and write the result back to s3://. I
have about 15 r3.8xlarge instances allocated with the included ec2 script.
The input data is about 1.5 TB and I expect the output to be similarly
sized. 15 r3.8xlarge instances give me about 3 TB of RAM and 9 TB of
storage, so hopefully more than enough for this task.

What happens is that it takes about an hour to read in the input from S3.
Once that is complete, then it begins to process the images and several
succeed. However, quickly, the job fails with no space left on device.
By time I can ssh into one of the machines that reported the error, temp
files have already been cleaned up. I don't see any more detailed messages
in the slave logs. I have not yet changed the logging configuration from
the default.

The S3 input and output are cached in /mnt/ephemeral-hdfs/s3 and
/mnt2/ephemeral-hdfs/s3 (I see mostly input files at the time of failure,
but maybe 1 output file per slave). Shuffle files are generated in
/mnt/spark/something and /mnt2/spark/something (they were cleaned up
once the job failed and I don't remember the directory that I saw while it
was still running). I checked the disk utilization for a few slaves while
running the pipeline and they were pretty far away from being full. But the
failure probably came from a slave that was overloaded from a shard
imbalance (but why would that happen on read - map - write?).

What other things might I need to configure to prevent this error? What
logging options do people recommend? Is there an easy way to diagnose spark
failures from the web interface like with Hadoop?

I need to do some more testing to make sure I'm not emitting a giant image
for a malformed input image, but I figured I'd post this question early in
case anyone had any recommendations.

BTW, why does a map-only job need to shuffle? I was expecting it to
pipeline the transfer in from S3 operation, the actual computation
operation, and the transfer back out to S3 operation rather than doing
everything serially with a giant disk footprint. Actually, I was thinking
it would fuse all three operations into a single stage. Is that not what
Spark does?

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-space-left-on-device-tp11829.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org

Re: Error: No space left on device

2014-07-17 Thread Chris DuBois

/spark2 as the local.dir by default,
 I
   would recommend leaving this setting as the default value.
  
   Best,
   Xiangrui
  
   On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois
   chris.dub...@gmail.com
   wrote:
Thanks for the quick responses!
   
I used your final -Dspark.local.dir suggestion, but I see this
during
the
initialization of the application:
   
14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
directory at
/vol/spark-local-20140716065608-7b2a
   
I would have expected something in /mnt/spark/.
   
Thanks,
Chris
   
   
   
On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore cdg...@cdgore.com
wrote:
   
Hi Chris,
   
I've encountered this error when running Spark’s ALS methods too.
In
my
case, it was because I set spark.local.dir improperly, and every
time
there
was a shuffle, it would spill many GB of data onto the local
 drive.
What
fixed it was setting it to use the /mnt directory, where a
 network
drive is
mounted.  For example, setting an environmental variable:
   
export SPACE=$(mount | grep mnt | awk '{print $3/spark/}' |
 xargs
|
sed
's/ /,/g’)
   
Then adding -Dspark.local.dir=$SPACE or simply
-Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your
 driver
application
   
Chris
   
On Jul 15, 2014, at 11:39 PM, Xiangrui Meng men...@gmail.com
wrote:
   
 Check the number of inodes (df -i). The assembly build may
 create
 many
 small files. -Xiangrui

 On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois
 chris.dub...@gmail.com
 wrote:
 Hi all,

 I am encountering the following error:

 INFO scheduler.TaskSetManager: Loss was due to
 java.io.IOException:
 No
 space
 left on device [duplicate 4]

 For each slave, df -h looks roughtly like this, which makes
 the
 above
 error
 surprising.

 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  4.4G  3.5G  57% /
 tmpfs 7.4G  4.0K  7.4G   1% /dev/shm
 /dev/xvdb  37G  3.3G   32G  10% /mnt
 /dev/xvdf  37G  2.0G   34G   6% /mnt2
 /dev/xvdv 500G   33M  500G   1% /vol

 I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched
 using
 the
 spark-ec2 scripts and a clone of spark from today. The job I
 am
 running
 closely resembles the collaborative filtering example. This
 issue
 happens
 with the 1M version as well as the 10 million rating version
 of
 the
 MovieLens dataset.

 I have seen previous questions, but they haven't helped yet.
 For
 example, I
 tried setting the Spark tmp directory to the EBS volume at
 /vol/,
 both
 by
 editing the spark conf file (and copy-dir'ing it to the
 slaves)
 as
 well
 as
 through the SparkConf. Yet I still get the above error. Here
 is
 my
 current
 Spark config below. Note that I'm launching via
 ~/spark/bin/spark-submit.

 conf = SparkConf()
 conf.setAppName(RecommendALS).set(spark.local.dir,
 /vol/).set(spark.executor.memory,
 7g).set(spark.akka.frameSize,
 100).setExecutorEnv(SPARK_JAVA_OPTS, 
 -Dspark.akka.frameSize=100)
 sc = SparkContext(conf=conf)

 Thanks for any advice,
 Chris

Re: Error: No space left on device

2014-07-17 Thread Bill Jay

 by default,
 I
   would recommend leaving this setting as the default value.
  
   Best,
   Xiangrui
  
   On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois
   chris.dub...@gmail.com
   wrote:
Thanks for the quick responses!
   
I used your final -Dspark.local.dir suggestion, but I see this
during
the
initialization of the application:
   
14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
directory at
/vol/spark-local-20140716065608-7b2a
   
I would have expected something in /mnt/spark/.
   
Thanks,
Chris
   
   
   
On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore cdg...@cdgore.com
wrote:
   
Hi Chris,
   
I've encountered this error when running Spark’s ALS methods too.
In
my
case, it was because I set spark.local.dir improperly, and every
time
there
was a shuffle, it would spill many GB of data onto the local
 drive.
What
fixed it was setting it to use the /mnt directory, where a
 network
drive is
mounted.  For example, setting an environmental variable:
   
export SPACE=$(mount | grep mnt | awk '{print $3/spark/}' |
 xargs
|
sed
's/ /,/g’)
   
Then adding -Dspark.local.dir=$SPACE or simply
-Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your
 driver
application
   
Chris
   
On Jul 15, 2014, at 11:39 PM, Xiangrui Meng men...@gmail.com
wrote:
   
 Check the number of inodes (df -i). The assembly build may
 create
 many
 small files. -Xiangrui

 On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois
 chris.dub...@gmail.com
 wrote:
 Hi all,

 I am encountering the following error:

 INFO scheduler.TaskSetManager: Loss was due to
 java.io.IOException:
 No
 space
 left on device [duplicate 4]

 For each slave, df -h looks roughtly like this, which makes
 the
 above
 error
 surprising.

 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  4.4G  3.5G  57% /
 tmpfs 7.4G  4.0K  7.4G   1% /dev/shm
 /dev/xvdb  37G  3.3G   32G  10% /mnt
 /dev/xvdf  37G  2.0G   34G   6% /mnt2
 /dev/xvdv 500G   33M  500G   1% /vol

 I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched
 using
 the
 spark-ec2 scripts and a clone of spark from today. The job I
 am
 running
 closely resembles the collaborative filtering example. This
 issue
 happens
 with the 1M version as well as the 10 million rating version
 of
 the
 MovieLens dataset.

 I have seen previous questions, but they haven't helped yet.
 For
 example, I
 tried setting the Spark tmp directory to the EBS volume at
 /vol/,
 both
 by
 editing the spark conf file (and copy-dir'ing it to the
 slaves)
 as
 well
 as
 through the SparkConf. Yet I still get the above error. Here
 is
 my
 current
 Spark config below. Note that I'm launching via
 ~/spark/bin/spark-submit.

 conf = SparkConf()
 conf.setAppName(RecommendALS).set(spark.local.dir,
 /vol/).set(spark.executor.memory,
 7g).set(spark.akka.frameSize,
 100).setExecutorEnv(SPARK_JAVA_OPTS, 
 -Dspark.akka.frameSize=100)
 sc = SparkContext(conf=conf)

 Thanks for any advice,
 Chris

Error: No space left on device

2014-07-16 Thread Chris DuBois

Hi all,

I am encountering the following error:

INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No
space left on device [duplicate 4]

For each slave, df -h looks roughtly like this, which makes the above error
surprising.

FilesystemSize  Used Avail Use% Mounted on
/dev/xvda17.9G  4.4G  3.5G  57% /
tmpfs 7.4G  4.0K  7.4G   1% /dev/shm
/dev/xvdb  37G  3.3G   32G  10% /mnt
/dev/xvdf  37G  2.0G   34G   6% /mnt2
/dev/xvdv 500G   33M  500G   1% /vol

I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
spark-ec2 scripts and a clone of spark from today. The job I am running
closely resembles the collaborative filtering example
https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html.
This issue happens with the 1M version as well as the 10 million rating
version of the MovieLens dataset.

I have seen previous
http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3c532f5aec.8060...@nengoiksvelzud.com%3E
 questions
https://groups.google.com/forum/#!msg/spark-users/Axx4optAj-E/q5lWMv-ZqnwJ,
but they haven't helped yet. For example, I tried setting the Spark tmp
directory to the EBS volume at /vol/, both by editing the spark conf file
(and copy-dir'ing it to the slaves) as well as through the SparkConf. Yet I
still get the above error. Here is my current Spark config below. Note that
I'm launching via ~/spark/bin/spark-submit.

conf = SparkConf()
conf.setAppName(RecommendALS).set(spark.local.dir,
/vol/).set(spark.executor.memory, 7g).set(spark.akka.frameSize,
100).setExecutorEnv(SPARK_JAVA_OPTS,  -Dspark.akka.frameSize=100)
sc = SparkContext(conf=conf)

Thanks for any advice,
Chris

Re: Error: No space left on device

2014-07-16 Thread Xiangrui Meng

Check the number of inodes (df -i). The assembly build may create many
small files. -Xiangrui

On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com wrote:
 Hi all,

 I am encountering the following error:

 INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No space
 left on device [duplicate 4]

 For each slave, df -h looks roughtly like this, which makes the above error
 surprising.

 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  4.4G  3.5G  57% /
 tmpfs 7.4G  4.0K  7.4G   1% /dev/shm
 /dev/xvdb  37G  3.3G   32G  10% /mnt
 /dev/xvdf  37G  2.0G   34G   6% /mnt2
 /dev/xvdv 500G   33M  500G   1% /vol

 I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
 spark-ec2 scripts and a clone of spark from today. The job I am running
 closely resembles the collaborative filtering example. This issue happens
 with the 1M version as well as the 10 million rating version of the
 MovieLens dataset.

 I have seen previous questions, but they haven't helped yet. For example, I
 tried setting the Spark tmp directory to the EBS volume at /vol/, both by
 editing the spark conf file (and copy-dir'ing it to the slaves) as well as
 through the SparkConf. Yet I still get the above error. Here is my current
 Spark config below. Note that I'm launching via ~/spark/bin/spark-submit.

 conf = SparkConf()
 conf.setAppName(RecommendALS).set(spark.local.dir,
 /vol/).set(spark.executor.memory, 7g).set(spark.akka.frameSize,
 100).setExecutorEnv(SPARK_JAVA_OPTS,  -Dspark.akka.frameSize=100)
 sc = SparkContext(conf=conf)

 Thanks for any advice,
 Chris

Re: Error: No space left on device

2014-07-16 Thread Chris DuBois

df -i  # on a slave

FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/xvda1524288  277701  246587   53% /
tmpfs1917974   1 19179731% /dev/shm


On Tue, Jul 15, 2014 at 11:39 PM, Xiangrui Meng men...@gmail.com wrote:

 Check the number of inodes (df -i). The assembly build may create many
 small files. -Xiangrui

 On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com
 wrote:
  Hi all,
 
  I am encountering the following error:
 
  INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No
 space
  left on device [duplicate 4]
 
  For each slave, df -h looks roughtly like this, which makes the above
 error
  surprising.
 
  FilesystemSize  Used Avail Use% Mounted on
  /dev/xvda17.9G  4.4G  3.5G  57% /
  tmpfs 7.4G  4.0K  7.4G   1% /dev/shm
  /dev/xvdb  37G  3.3G   32G  10% /mnt
  /dev/xvdf  37G  2.0G   34G   6% /mnt2
  /dev/xvdv 500G   33M  500G   1% /vol
 
  I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
  spark-ec2 scripts and a clone of spark from today. The job I am running
  closely resembles the collaborative filtering example. This issue happens
  with the 1M version as well as the 10 million rating version of the
  MovieLens dataset.
 
  I have seen previous questions, but they haven't helped yet. For
 example, I
  tried setting the Spark tmp directory to the EBS volume at /vol/, both by
  editing the spark conf file (and copy-dir'ing it to the slaves) as well
 as
  through the SparkConf. Yet I still get the above error. Here is my
 current
  Spark config below. Note that I'm launching via ~/spark/bin/spark-submit.
 
  conf = SparkConf()
  conf.setAppName(RecommendALS).set(spark.local.dir,
  /vol/).set(spark.executor.memory, 7g).set(spark.akka.frameSize,
  100).setExecutorEnv(SPARK_JAVA_OPTS,  -Dspark.akka.frameSize=100)
  sc = SparkContext(conf=conf)
 
  Thanks for any advice,
  Chris

Re: Error: No space left on device

2014-07-16 Thread Chris Gore

Hi Chris,

I've encountered this error when running Spark’s ALS methods too.  In my case, 
it was because I set spark.local.dir improperly, and every time there was a 
shuffle, it would spill many GB of data onto the local drive.  What fixed it 
was setting it to use the /mnt directory, where a network drive is mounted.  
For example, setting an environmental variable:

export SPACE=$(mount | grep mnt | awk '{print $3/spark/}' | xargs | sed 's/ 
/,/g’)

Then adding -Dspark.local.dir=$SPACE or simply 
-Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver application

Chris

On Jul 15, 2014, at 11:39 PM, Xiangrui Meng men...@gmail.com wrote:

 Check the number of inodes (df -i). The assembly build may create many
 small files. -Xiangrui
 
 On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com wrote:
 Hi all,
 
 I am encountering the following error:
 
 INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No space
 left on device [duplicate 4]
 
 For each slave, df -h looks roughtly like this, which makes the above error
 surprising.
 
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  4.4G  3.5G  57% /
 tmpfs 7.4G  4.0K  7.4G   1% /dev/shm
 /dev/xvdb  37G  3.3G   32G  10% /mnt
 /dev/xvdf  37G  2.0G   34G   6% /mnt2
 /dev/xvdv 500G   33M  500G   1% /vol
 
 I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
 spark-ec2 scripts and a clone of spark from today. The job I am running
 closely resembles the collaborative filtering example. This issue happens
 with the 1M version as well as the 10 million rating version of the
 MovieLens dataset.
 
 I have seen previous questions, but they haven't helped yet. For example, I
 tried setting the Spark tmp directory to the EBS volume at /vol/, both by
 editing the spark conf file (and copy-dir'ing it to the slaves) as well as
 through the SparkConf. Yet I still get the above error. Here is my current
 Spark config below. Note that I'm launching via ~/spark/bin/spark-submit.
 
 conf = SparkConf()
 conf.setAppName(RecommendALS).set(spark.local.dir,
 /vol/).set(spark.executor.memory, 7g).set(spark.akka.frameSize,
 100).setExecutorEnv(SPARK_JAVA_OPTS,  -Dspark.akka.frameSize=100)
 sc = SparkContext(conf=conf)
 
 Thanks for any advice,
 Chris

Re: Error: No space left on device

2014-07-16 Thread Chris DuBois

Thanks for the quick responses!

I used your final -Dspark.local.dir suggestion, but I see this during the
initialization of the application:

14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local directory at
/vol/spark-local-20140716065608-7b2a

I would have expected something in /mnt/spark/.

Thanks,
Chris



On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore cdg...@cdgore.com wrote:

 Hi Chris,

 I've encountered this error when running Spark’s ALS methods too.  In my
 case, it was because I set spark.local.dir improperly, and every time there
 was a shuffle, it would spill many GB of data onto the local drive.  What
 fixed it was setting it to use the /mnt directory, where a network drive is
 mounted.  For example, setting an environmental variable:

 export SPACE=$(mount | grep mnt | awk '{print $3/spark/}' | xargs | sed
 's/ /,/g’)

 Then adding -Dspark.local.dir=$SPACE or simply
 -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
 application

 Chris

 On Jul 15, 2014, at 11:39 PM, Xiangrui Meng men...@gmail.com wrote:

  Check the number of inodes (df -i). The assembly build may create many
  small files. -Xiangrui
 
  On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com
 wrote:
  Hi all,
 
  I am encountering the following error:
 
  INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No
 space
  left on device [duplicate 4]
 
  For each slave, df -h looks roughtly like this, which makes the above
 error
  surprising.
 
  FilesystemSize  Used Avail Use% Mounted on
  /dev/xvda17.9G  4.4G  3.5G  57% /
  tmpfs 7.4G  4.0K  7.4G   1% /dev/shm
  /dev/xvdb  37G  3.3G   32G  10% /mnt
  /dev/xvdf  37G  2.0G   34G   6% /mnt2
  /dev/xvdv 500G   33M  500G   1% /vol
 
  I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
  spark-ec2 scripts and a clone of spark from today. The job I am running
  closely resembles the collaborative filtering example. This issue
 happens
  with the 1M version as well as the 10 million rating version of the
  MovieLens dataset.
 
  I have seen previous questions, but they haven't helped yet. For
 example, I
  tried setting the Spark tmp directory to the EBS volume at /vol/, both
 by
  editing the spark conf file (and copy-dir'ing it to the slaves) as well
 as
  through the SparkConf. Yet I still get the above error. Here is my
 current
  Spark config below. Note that I'm launching via
 ~/spark/bin/spark-submit.
 
  conf = SparkConf()
  conf.setAppName(RecommendALS).set(spark.local.dir,
  /vol/).set(spark.executor.memory, 7g).set(spark.akka.frameSize,
  100).setExecutorEnv(SPARK_JAVA_OPTS,  -Dspark.akka.frameSize=100)
  sc = SparkContext(conf=conf)
 
  Thanks for any advice,
  Chris

Re: Error: No space left on device

2014-07-16 Thread Xiangrui Meng

Hi Chris,

Could you also try `df -i` on the master node? How many
blocks/partitions did you set?

In the current implementation, ALS doesn't clean the shuffle data
because the operations are chained together. But it shouldn't run out
of disk space on the MovieLens dataset, which is small. spark-ec2
script sets /mnt/spark and /mnt/spark2 as the local.dir by default, I
would recommend leaving this setting as the default value.

Best,
Xiangrui

On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois chris.dub...@gmail.com wrote:
 Thanks for the quick responses!

 I used your final -Dspark.local.dir suggestion, but I see this during the
 initialization of the application:

 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local directory at
 /vol/spark-local-20140716065608-7b2a

 I would have expected something in /mnt/spark/.

 Thanks,
 Chris



 On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore cdg...@cdgore.com wrote:

 Hi Chris,

 I've encountered this error when running Spark’s ALS methods too.  In my
 case, it was because I set spark.local.dir improperly, and every time there
 was a shuffle, it would spill many GB of data onto the local drive.  What
 fixed it was setting it to use the /mnt directory, where a network drive is
 mounted.  For example, setting an environmental variable:

 export SPACE=$(mount | grep mnt | awk '{print $3/spark/}' | xargs | sed
 's/ /,/g’)

 Then adding -Dspark.local.dir=$SPACE or simply
 -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
 application

 Chris

 On Jul 15, 2014, at 11:39 PM, Xiangrui Meng men...@gmail.com wrote:

  Check the number of inodes (df -i). The assembly build may create many
  small files. -Xiangrui
 
  On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com
  wrote:
  Hi all,
 
  I am encountering the following error:
 
  INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No
  space
  left on device [duplicate 4]
 
  For each slave, df -h looks roughtly like this, which makes the above
  error
  surprising.
 
  FilesystemSize  Used Avail Use% Mounted on
  /dev/xvda17.9G  4.4G  3.5G  57% /
  tmpfs 7.4G  4.0K  7.4G   1% /dev/shm
  /dev/xvdb  37G  3.3G   32G  10% /mnt
  /dev/xvdf  37G  2.0G   34G   6% /mnt2
  /dev/xvdv 500G   33M  500G   1% /vol
 
  I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
  spark-ec2 scripts and a clone of spark from today. The job I am running
  closely resembles the collaborative filtering example. This issue
  happens
  with the 1M version as well as the 10 million rating version of the
  MovieLens dataset.
 
  I have seen previous questions, but they haven't helped yet. For
  example, I
  tried setting the Spark tmp directory to the EBS volume at /vol/, both
  by
  editing the spark conf file (and copy-dir'ing it to the slaves) as well
  as
  through the SparkConf. Yet I still get the above error. Here is my
  current
  Spark config below. Note that I'm launching via
  ~/spark/bin/spark-submit.
 
  conf = SparkConf()
  conf.setAppName(RecommendALS).set(spark.local.dir,
  /vol/).set(spark.executor.memory, 7g).set(spark.akka.frameSize,
  100).setExecutorEnv(SPARK_JAVA_OPTS,  -Dspark.akka.frameSize=100)
  sc = SparkContext(conf=conf)
 
  Thanks for any advice,
  Chris

Re: Error: No space left on device

2014-07-16 Thread Chris DuBois

Hi Xiangrui,

Here is the result on the master node:
$ df -i
FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/xvda1524288  273997  250291   53% /
tmpfs1917974   1 19179731% /dev/shm
/dev/xvdv524288000  30 5242879701% /vol

I have reproduced the error while using the MovieLens 10M data set on a
newly created cluster.

Thanks for the help.
Chris


On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Chris,

 Could you also try `df -i` on the master node? How many
 blocks/partitions did you set?

 In the current implementation, ALS doesn't clean the shuffle data
 because the operations are chained together. But it shouldn't run out
 of disk space on the MovieLens dataset, which is small. spark-ec2
 script sets /mnt/spark and /mnt/spark2 as the local.dir by default, I
 would recommend leaving this setting as the default value.

 Best,
 Xiangrui

 On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois chris.dub...@gmail.com
 wrote:
  Thanks for the quick responses!
 
  I used your final -Dspark.local.dir suggestion, but I see this during the
  initialization of the application:
 
  14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local directory
 at
  /vol/spark-local-20140716065608-7b2a
 
  I would have expected something in /mnt/spark/.
 
  Thanks,
  Chris
 
 
 
  On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore cdg...@cdgore.com wrote:
 
  Hi Chris,
 
  I've encountered this error when running Spark’s ALS methods too.  In my
  case, it was because I set spark.local.dir improperly, and every time
 there
  was a shuffle, it would spill many GB of data onto the local drive.
  What
  fixed it was setting it to use the /mnt directory, where a network
 drive is
  mounted.  For example, setting an environmental variable:
 
  export SPACE=$(mount | grep mnt | awk '{print $3/spark/}' | xargs |
 sed
  's/ /,/g’)
 
  Then adding -Dspark.local.dir=$SPACE or simply
  -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
  application
 
  Chris
 
  On Jul 15, 2014, at 11:39 PM, Xiangrui Meng men...@gmail.com wrote:
 
   Check the number of inodes (df -i). The assembly build may create many
   small files. -Xiangrui
  
   On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois 
 chris.dub...@gmail.com
   wrote:
   Hi all,
  
   I am encountering the following error:
  
   INFO scheduler.TaskSetManager: Loss was due to java.io.IOException:
 No
   space
   left on device [duplicate 4]
  
   For each slave, df -h looks roughtly like this, which makes the above
   error
   surprising.
  
   FilesystemSize  Used Avail Use% Mounted on
   /dev/xvda17.9G  4.4G  3.5G  57% /
   tmpfs 7.4G  4.0K  7.4G   1% /dev/shm
   /dev/xvdb  37G  3.3G   32G  10% /mnt
   /dev/xvdf  37G  2.0G   34G   6% /mnt2
   /dev/xvdv 500G   33M  500G   1% /vol
  
   I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
   spark-ec2 scripts and a clone of spark from today. The job I am
 running
   closely resembles the collaborative filtering example. This issue
   happens
   with the 1M version as well as the 10 million rating version of the
   MovieLens dataset.
  
   I have seen previous questions, but they haven't helped yet. For
   example, I
   tried setting the Spark tmp directory to the EBS volume at /vol/,
 both
   by
   editing the spark conf file (and copy-dir'ing it to the slaves) as
 well
   as
   through the SparkConf. Yet I still get the above error. Here is my
   current
   Spark config below. Note that I'm launching via
   ~/spark/bin/spark-submit.
  
   conf = SparkConf()
   conf.setAppName(RecommendALS).set(spark.local.dir,
   /vol/).set(spark.executor.memory,
 7g).set(spark.akka.frameSize,
   100).setExecutorEnv(SPARK_JAVA_OPTS, 
 -Dspark.akka.frameSize=100)
   sc = SparkContext(conf=conf)
  
   Thanks for any advice,
   Chris

Re: Error: No space left on device

2014-07-16 Thread Chris DuBois

Hi Xiangrui,

I accidentally did not send df -i for the master node. Here it is at the
moment of failure:

FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/xvda1524288  280938  243350   54% /
tmpfs3845409   1 38454081% /dev/shm
/dev/xvdb100024321027 100014051% /mnt
/dev/xvdf10002432  16 100024161% /mnt2
/dev/xvdv524288000  13 5242879871% /vol

I am using default settings now, but is there a way to make sure that the
proper directories are being used? How many blocks/partitions do you
recommend?

Chris


On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois chris.dub...@gmail.com
wrote:

 Hi Xiangrui,

 Here is the result on the master node:
 $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288  273997  250291   53% /
 tmpfs1917974   1 19179731% /dev/shm
 /dev/xvdv524288000  30 5242879701% /vol

 I have reproduced the error while using the MovieLens 10M data set on a
 newly created cluster.

 Thanks for the help.
 Chris


 On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Chris,

 Could you also try `df -i` on the master node? How many
 blocks/partitions did you set?

 In the current implementation, ALS doesn't clean the shuffle data
 because the operations are chained together. But it shouldn't run out
 of disk space on the MovieLens dataset, which is small. spark-ec2
 script sets /mnt/spark and /mnt/spark2 as the local.dir by default, I
 would recommend leaving this setting as the default value.

 Best,
 Xiangrui

 On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois chris.dub...@gmail.com
 wrote:
  Thanks for the quick responses!
 
  I used your final -Dspark.local.dir suggestion, but I see this during
 the
  initialization of the application:
 
  14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
 directory at
  /vol/spark-local-20140716065608-7b2a
 
  I would have expected something in /mnt/spark/.
 
  Thanks,
  Chris
 
 
 
  On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore cdg...@cdgore.com wrote:
 
  Hi Chris,
 
  I've encountered this error when running Spark’s ALS methods too.  In
 my
  case, it was because I set spark.local.dir improperly, and every time
 there
  was a shuffle, it would spill many GB of data onto the local drive.
  What
  fixed it was setting it to use the /mnt directory, where a network
 drive is
  mounted.  For example, setting an environmental variable:
 
  export SPACE=$(mount | grep mnt | awk '{print $3/spark/}' | xargs |
 sed
  's/ /,/g’)
 
  Then adding -Dspark.local.dir=$SPACE or simply
  -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
  application
 
  Chris
 
  On Jul 15, 2014, at 11:39 PM, Xiangrui Meng men...@gmail.com wrote:
 
   Check the number of inodes (df -i). The assembly build may create
 many
   small files. -Xiangrui
  
   On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois 
 chris.dub...@gmail.com
   wrote:
   Hi all,
  
   I am encountering the following error:
  
   INFO scheduler.TaskSetManager: Loss was due to java.io.IOException:
 No
   space
   left on device [duplicate 4]
  
   For each slave, df -h looks roughtly like this, which makes the
 above
   error
   surprising.
  
   FilesystemSize  Used Avail Use% Mounted on
   /dev/xvda17.9G  4.4G  3.5G  57% /
   tmpfs 7.4G  4.0K  7.4G   1% /dev/shm
   /dev/xvdb  37G  3.3G   32G  10% /mnt
   /dev/xvdf  37G  2.0G   34G   6% /mnt2
   /dev/xvdv 500G   33M  500G   1% /vol
  
   I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
   spark-ec2 scripts and a clone of spark from today. The job I am
 running
   closely resembles the collaborative filtering example. This issue
   happens
   with the 1M version as well as the 10 million rating version of the
   MovieLens dataset.
  
   I have seen previous questions, but they haven't helped yet. For
   example, I
   tried setting the Spark tmp directory to the EBS volume at /vol/,
 both
   by
   editing the spark conf file (and copy-dir'ing it to the slaves) as
 well
   as
   through the SparkConf. Yet I still get the above error. Here is my
   current
   Spark config below. Note that I'm launching via
   ~/spark/bin/spark-submit.
  
   conf = SparkConf()
   conf.setAppName(RecommendALS).set(spark.local.dir,
   /vol/).set(spark.executor.memory,
 7g).set(spark.akka.frameSize,
   100).setExecutorEnv(SPARK_JAVA_OPTS, 
 -Dspark.akka.frameSize=100)
   sc = SparkContext(conf=conf)
  
   Thanks for any advice,
   Chris

Re: Error: No space left on device

2014-07-16 Thread Chris DuBois

 error:
   
INFO scheduler.TaskSetManager: Loss was due to
 java.io.IOException:
No
space
left on device [duplicate 4]
   
For each slave, df -h looks roughtly like this, which makes the
above
error
surprising.
   
FilesystemSize  Used Avail Use% Mounted on
/dev/xvda17.9G  4.4G  3.5G  57% /
tmpfs 7.4G  4.0K  7.4G   1% /dev/shm
/dev/xvdb  37G  3.3G   32G  10% /mnt
/dev/xvdf  37G  2.0G   34G   6% /mnt2
/dev/xvdv 500G   33M  500G   1% /vol
   
I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using
the
spark-ec2 scripts and a clone of spark from today. The job I am
running
closely resembles the collaborative filtering example. This issue
happens
with the 1M version as well as the 10 million rating version of
 the
MovieLens dataset.
   
I have seen previous questions, but they haven't helped yet. For
example, I
tried setting the Spark tmp directory to the EBS volume at /vol/,
both
by
editing the spark conf file (and copy-dir'ing it to the slaves)
 as
well
as
through the SparkConf. Yet I still get the above error. Here is
 my
current
Spark config below. Note that I'm launching via
~/spark/bin/spark-submit.
   
conf = SparkConf()
conf.setAppName(RecommendALS).set(spark.local.dir,
/vol/).set(spark.executor.memory,
7g).set(spark.akka.frameSize,
100).setExecutorEnv(SPARK_JAVA_OPTS, 
-Dspark.akka.frameSize=100)
sc = SparkContext(conf=conf)
   
Thanks for any advice,
Chris

Re: No space left on device error when pulling data from s3

2014-05-15 Thread darkjh

Set `hadoop.tmp.dir` in `spark-env.sh` solved the problem. Spark job no
longer writes tmp files in /tmp/hadoop-root/.

  SPARK_JAVA_OPTS+= -Dspark.local.dir=/mnt/spark,/mnt2/spark
-Dhadoop.tmp.dir=/mnt/ephemeral-hdfs
  export SPARK_JAVA_OPTS

I'm wondering if we need to permanently add this in the spark-ec2 script.
Writing lots of tmp files in the 8GB `/` is not a great idea.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-space-left-on-device-error-when-pulling-data-from-s3-tp5450p5518.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

No space left on device error when pulling data from s3

2014-05-06 Thread Han JU

Hi,

I've a `no space left on device` exception when pulling some 22GB data from
s3 block storage to the ephemeral HDFS. The cluster is on EC2 using
spark-ec2 script with 4 m1.large.

The code is basically:
  val in = sc.textFile(s3://...)
  in.saveAsTextFile(hdfs://...)

Spark creates 750 input partitions based on the input splits, when it
begins throwing this exception, there's no space left on the root file
system on some worker machine:

Filesystem   1K-blocks  Used Available Use% Mounted on
/dev/xvda1 8256952   8256952 0 100% /
tmpfs  3816808 0   3816808   0% /dev/shm
/dev/xvdb433455904  29840684 381596916   8% /mnt
/dev/xvdf433455904  29437000 382000600   8% /mnt2

Before the job begins, only 35% is used.

Filesystem   1K-blocks  Used Available Use% Mounted on
/dev/xvda1 8256952   2832256   5340840  35% /
tmpfs  3816808 0   3816808   0% /dev/shm
/dev/xvdb433455904  29857768 381579832   8% /mnt
/dev/xvdf433455904  29470104 381967496   8% /mnt2


Some suggestions on this problem? Does Spark caches/stores some data before
writing to HDFS?


Full stacktrace:
-
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveBlock(Jets3tFileSystemStore.java:210)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at com.sun.proxy.$Proxy8.retrieveBlock(Unknown Source)
at org.apache.hadoop.fs.s3.S3InputStream.blockSeekTo(S3InputStream.java:160)
at org.apache.hadoop.fs.s3.S3InputStream.read(S3InputStream.java:119)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at
org.apache.hadoop.mapred.LineRecordReader.init(LineRecordReader.java:92)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


-- 
*JU Han*

Data Engineer @ Botify.com

+33 061960

Re: No space left on device error when pulling data from s3

2014-05-06 Thread Akhil Das

I wonder why is your / is full. Try clearing out /tmp and also make sure in
the spark-env.sh you have put SPARK_JAVA_OPTS+=
-Dspark.local.dir=/mnt/spark

Thanks
Best Regards


On Tue, May 6, 2014 at 9:35 PM, Han JU ju.han.fe...@gmail.com wrote:

 Hi,

 I've a `no space left on device` exception when pulling some 22GB data
 from s3 block storage to the ephemeral HDFS. The cluster is on EC2 using
 spark-ec2 script with 4 m1.large.

 The code is basically:
   val in = sc.textFile(s3://...)
   in.saveAsTextFile(hdfs://...)

 Spark creates 750 input partitions based on the input splits, when it
 begins throwing this exception, there's no space left on the root file
 system on some worker machine:

 Filesystem   1K-blocks  Used Available Use% Mounted on
 /dev/xvda1 8256952   8256952 0 100% /
 tmpfs  3816808 0   3816808   0% /dev/shm
 /dev/xvdb433455904  29840684 381596916   8% /mnt
 /dev/xvdf433455904  29437000 382000600   8% /mnt2

 Before the job begins, only 35% is used.

 Filesystem   1K-blocks  Used Available Use% Mounted on
 /dev/xvda1 8256952   2832256   5340840  35% /
 tmpfs  3816808 0   3816808   0% /dev/shm
 /dev/xvdb433455904  29857768 381579832   8% /mnt
 /dev/xvdf433455904  29470104 381967496   8% /mnt2


 Some suggestions on this problem? Does Spark caches/stores some data
 before writing to HDFS?


 Full stacktrace:
 -
 java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
  at java.io.FileOutputStream.write(FileOutputStream.java:345)
 at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
  at
 org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveBlock(Jets3tFileSystemStore.java:210)
 at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
  at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
 at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
  at com.sun.proxy.$Proxy8.retrieveBlock(Unknown Source)
 at
 org.apache.hadoop.fs.s3.S3InputStream.blockSeekTo(S3InputStream.java:160)
  at org.apache.hadoop.fs.s3.S3InputStream.read(S3InputStream.java:119)
 at java.io.DataInputStream.read(DataInputStream.java:100)
  at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
 at
 org.apache.hadoop.mapred.LineRecordReader.init(LineRecordReader.java:92)
  at
 org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
  at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
 at org.apache.spark.scheduler.Task.run(Task.scala:53)
  at
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
 at
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)


 --
 *JU Han*

 Data Engineer @ Botify.com

 +33 061960

Re: No space left on device error when pulling data from s3

2014-05-06 Thread Han JU

After some investigation, I found out that there's lots of temp files under

/tmp/hadoop-root/s3/

But this is strange since in both conf files,
~/ephemeral-hdfs/conf/core-site.xml and ~/spark/conf/core-site.xml, the
setting `hadoop.tmp.dir` is set to `/mnt/ephemeral-hdfs/`. Why spark jobs
still write temp files to /tmp/hadoop-root ?


2014-05-06 18:05 GMT+02:00 Han JU ju.han.fe...@gmail.com:

 Hi,

 I've a `no space left on device` exception when pulling some 22GB data
 from s3 block storage to the ephemeral HDFS. The cluster is on EC2 using
 spark-ec2 script with 4 m1.large.

 The code is basically:
   val in = sc.textFile(s3://...)
   in.saveAsTextFile(hdfs://...)

 Spark creates 750 input partitions based on the input splits, when it
 begins throwing this exception, there's no space left on the root file
 system on some worker machine:

 Filesystem   1K-blocks  Used Available Use% Mounted on
 /dev/xvda1 8256952   8256952 0 100% /
 tmpfs  3816808 0   3816808   0% /dev/shm
 /dev/xvdb433455904  29840684 381596916   8% /mnt
 /dev/xvdf433455904  29437000 382000600   8% /mnt2

 Before the job begins, only 35% is used.

 Filesystem   1K-blocks  Used Available Use% Mounted on
 /dev/xvda1 8256952   2832256   5340840  35% /
 tmpfs  3816808 0   3816808   0% /dev/shm
 /dev/xvdb433455904  29857768 381579832   8% /mnt
 /dev/xvdf433455904  29470104 381967496   8% /mnt2


 Some suggestions on this problem? Does Spark caches/stores some data
 before writing to HDFS?


 Full stacktrace:
 -
 java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
  at java.io.FileOutputStream.write(FileOutputStream.java:345)
 at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
  at
 org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveBlock(Jets3tFileSystemStore.java:210)
 at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
  at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
 at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
  at com.sun.proxy.$Proxy8.retrieveBlock(Unknown Source)
 at
 org.apache.hadoop.fs.s3.S3InputStream.blockSeekTo(S3InputStream.java:160)
  at org.apache.hadoop.fs.s3.S3InputStream.read(S3InputStream.java:119)
 at java.io.DataInputStream.read(DataInputStream.java:100)
  at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
 at
 org.apache.hadoop.mapred.LineRecordReader.init(LineRecordReader.java:92)
  at
 org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
  at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
 at org.apache.spark.scheduler.Task.run(Task.scala:53)
  at
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
 at
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)


 --
 *JU Han*

 Data Engineer @ Botify.com

 +33 061960




-- 
*JU Han*

Data Engineer @ Botify.com

+33 061960

Re: No space left on device exception

2014-03-24 Thread Ognen Duzlevski

Patrick, correct. I have a 16 node cluster. On 14 machines out of 16, 
the inode usage was about 50%. On two of the slaves, one had inode usage 
of 96% and on the other it was 100%. When i went into /tmp on these two 
nodes - there were a bunch of /tmp/spark* subdirectories which I 
deleted. This resulted in the inode consumption falling back down to 50% 
and the job running successfully to completion. The slave with the 100% 
inode usage had the spark/work/app/number/stdout with the message that 
the filesystem is running out of disk space (which I posted in the 
original email that started this thread).


What is interesting is that only two out of the 16 slaves had this 
problem :)


Ognen

On 3/24/14, 12:57 AM, Patrick Wendell wrote:

Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a No space left on device error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.

On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
og...@nengoiksvelzud.com wrote:

I would love to work on this (and other) stuff if I can bother someone with
questions offline or on a dev mailing list.
Ognen


On 3/23/14, 10:04 PM, Aaron Davidson wrote:

Thanks for bringing this up, 100% inode utilization is an issue I haven't
seen raised before and this raises another issue which is not on our current
roadmap for state cleanup (cleaning up data which was not fully cleaned up
from a crashed process).


On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
og...@plainvanillagames.com wrote:

Bleh, strike that, one of my slaves was at 100% inode utilization on the
file system. It was /tmp/spark* leftovers that apparently did not get
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.

Thanks (and sorry for the noise)!
Ognen


On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:

Aaron, thanks for replying. I am very much puzzled as to what is going on.
A job that used to run on the same cluster is failing with this mysterious
message about not having enough disk space when in fact I can see through
watch df -h that the free space is always hovering around 3+GB on the disk
and the free inodes are at 50% (this is on master). I went through each
slave and the spark/work/app*/stderr and stdout and spark/logs/*out files
and no mention of too many open files failures on any of the slaves nor on
the master :(

Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:

By default, with P partitions (for both the pre-shuffle stage and
post-shuffle), there are P^2 files created. With
spark.shuffle.consolidateFiles turned on, we would instead create only P
files. Disk space consumption is largely unaffected, however. by the number
of partitions unless each partition is particularly small.

You might look at the actual executors' logs, as it's possible that this
error was caused by an earlier exception, such as too many open files.


On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
og...@plainvanillagames.com wrote:

On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere other than /tmp if
/tmp is full. Actually it's recommended to have multiple local disks and set
to to a comma-separated list of directories, one per disk.

Matei, does the number of tasks/partitions in a transformation influence
something in terms of disk space consumption? Or inode consumption?

Thanks,
Ognen



--
A distributed system is one in which the failure of a computer you didn't
even know existed can render your own computer unusable
-- Leslie Lamport



--
No matter what they ever do to us, we must always act for the love of our
people and the earth. We must not react out of hatred against those who have
no sense.
-- John Trudell


--
“A distributed system is one in which the failure of a computer you didn’t even 
know existed can render your own computer unusable”
-- Leslie Lamport

Re: No space left on device exception

2014-03-24 Thread Ognen Duzlevski

Another thing I have noticed is that out of my master+15 slaves, two 
slaves always carry a higher inode load. So for example right now I am 
running an intensive job that takes about an hour to finish and two 
slaves have been showing an increase in inode consumption (they are 
about 10% above the rest of the slaves+master) and increasing.


Ognen

On 3/24/14, 7:00 AM, Ognen Duzlevski wrote:
Patrick, correct. I have a 16 node cluster. On 14 machines out of 16, 
the inode usage was about 50%. On two of the slaves, one had inode 
usage of 96% and on the other it was 100%. When i went into /tmp on 
these two nodes - there were a bunch of /tmp/spark* subdirectories 
which I deleted. This resulted in the inode consumption falling back 
down to 50% and the job running successfully to completion. The slave 
with the 100% inode usage had the spark/work/app/number/stdout with 
the message that the filesystem is running out of disk space (which I 
posted in the original email that started this thread).


What is interesting is that only two out of the 16 slaves had this 
problem :)


Ognen

On 3/24/14, 12:57 AM, Patrick Wendell wrote:

Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a No space left on device error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.

On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
og...@nengoiksvelzud.com wrote:
I would love to work on this (and other) stuff if I can bother 
someone with

questions offline or on a dev mailing list.
Ognen


On 3/23/14, 10:04 PM, Aaron Davidson wrote:

Thanks for bringing this up, 100% inode utilization is an issue I 
haven't
seen raised before and this raises another issue which is not on our 
current
roadmap for state cleanup (cleaning up data which was not fully 
cleaned up

from a crashed process).


On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
og...@plainvanillagames.com wrote:
Bleh, strike that, one of my slaves was at 100% inode utilization 
on the

file system. It was /tmp/spark* leftovers that apparently did not get
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.

Thanks (and sorry for the noise)!
Ognen


On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:

Aaron, thanks for replying. I am very much puzzled as to what is 
going on.
A job that used to run on the same cluster is failing with this 
mysterious
message about not having enough disk space when in fact I can see 
through
watch df -h that the free space is always hovering around 3+GB on 
the disk
and the free inodes are at 50% (this is on master). I went through 
each
slave and the spark/work/app*/stderr and stdout and spark/logs/*out 
files
and no mention of too many open files failures on any of the slaves 
nor on

the master :(

Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:

By default, with P partitions (for both the pre-shuffle stage and
post-shuffle), there are P^2 files created. With
spark.shuffle.consolidateFiles turned on, we would instead create 
only P
files. Disk space consumption is largely unaffected, however. by 
the number

of partitions unless each partition is particularly small.

You might look at the actual executors' logs, as it's possible that 
this
error was caused by an earlier exception, such as too many open 
files.



On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
og...@plainvanillagames.com wrote:

On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere other than 
/tmp if
/tmp is full. Actually it's recommended to have multiple local 
disks and set

to to a comma-separated list of directories, one per disk.

Matei, does the number of tasks/partitions in a transformation 
influence

something in terms of disk space consumption? Or inode consumption?

Thanks,
Ognen



--
A distributed system is one in which the failure of a computer you 
didn't

even know existed can render your own computer unusable
-- Leslie Lamport



--
No matter what they ever do to us, we must always act for the love 
of our
people and the earth. We must not react out of hatred against those 
who have

no sense.
-- John Trudell




--
“A distributed system is one in which the failure of a computer you didn’t even 
know existed can render your own computer unusable”
-- Leslie Lamport

No space left on device exception

2014-03-23 Thread Ognen Duzlevski


Hello,

I have a weird error showing up when I run a job on my Spark cluster. 
The version of spark is 0.9 and I have 3+ GB free on the disk when this 
error shows up. Any ideas what I should be looking for?


[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 
167.0:3 failed 4 times (most recent failure: Exception failure: 
java.io.FileNotFoundException: 
/tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left 
on device))
org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 
times (most recent failure: Exception failure: 
java.io.FileNotFoundException: 
/tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left 
on device))
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)

at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)


Thanks!
Ognen

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson

On some systems, /tmp/ is an in-memory tmpfs file system, with its own size
limit. It's possible that this limit has been exceeded. You might try
running the df command to check to free space of /tmp or root if tmp
isn't listed.

3 GB also seems pretty low for the remaining free space of a disk. If your
disk size is in the TB range, it's possible that the last couple GB have
issues when being allocated due to fragmentation or reclamation policies.


On Sun, Mar 23, 2014 at 3:06 PM, Ognen Duzlevski
og...@nengoiksvelzud.comwrote:

 Hello,

 I have a weird error showing up when I run a job on my Spark cluster. The
 version of spark is 0.9 and I have 3+ GB free on the disk when this error
 shows up. Any ideas what I should be looking for?

 [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task
 167.0:3 failed 4 times (most recent failure: Exception failure:
 java.io.FileNotFoundException: 
 /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127
 (No space left on device))
 org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times
 (most recent failure: Exception failure: java.io.FileNotFoundException:
 /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left
 on device))
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$
 apache$spark$scheduler$DAGScheduler$$abortStage$1.
 apply(DAGScheduler.scala:1028)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$
 apache$spark$scheduler$DAGScheduler$$abortStage$1.
 apply(DAGScheduler.scala:1026)
 at scala.collection.mutable.ResizableArray$class.foreach(
 ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
 scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$
 processEvent$10.apply(DAGScheduler.scala:619)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$
 processEvent$10.apply(DAGScheduler.scala:619)
 at scala.Option.foreach(Option.scala:236)
 at org.apache.spark.scheduler.DAGScheduler.processEvent(
 DAGScheduler.scala:619)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$
 $anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)

 Thanks!
 Ognen

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski


On 3/23/14, 5:49 PM, Matei Zaharia wrote:
You can set spark.local.dir to put this data somewhere other than /tmp 
if /tmp is full. Actually it’s recommended to have multiple local 
disks and set to to a comma-separated list of directories, one per disk.
Matei, does the number of tasks/partitions in a transformation influence 
something in terms of disk space consumption? Or inode consumption?


Thanks,
Ognen

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski

Aaron, thanks for replying. I am very much puzzled as to what is going 
on. A job that used to run on the same cluster is failing with this 
mysterious message about not having enough disk space when in fact I can 
see through watch df -h that the free space is always hovering around 
3+GB on the disk and the free inodes are at 50% (this is on master). I 
went through each slave and the spark/work/app*/stderr and stdout and 
spark/logs/*out files and no mention of too many open files failures on 
any of the slaves nor on the master :(


Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:
By default, with P partitions (for both the pre-shuffle stage and 
post-shuffle), there are P^2 files created. 
With spark.shuffle.consolidateFiles turned on, we would instead create 
only P files. Disk space consumption is largely unaffected, however. 
by the number of partitions unless each partition is particularly small.


You might look at the actual executors' logs, as it's possible that 
this error was caused by an earlier exception, such as too many open 
files.



On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski 
og...@plainvanillagames.com mailto:og...@plainvanillagames.com wrote:


On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere other than
/tmp if /tmp is full. Actually it's recommended to have multiple
local disks and set to to a comma-separated list of directories,
one per disk.

Matei, does the number of tasks/partitions in a transformation
influence something in terms of disk space consumption? Or inode
consumption?

Thanks,
Ognen

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski

Bleh, strike that, one of my slaves was at 100% inode utilization on the 
file system. It was /tmp/spark* leftovers that apparently did not get 
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up 
/tmp/spark* regularly.


Thanks (and sorry for the noise)!
Ognen

On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
Aaron, thanks for replying. I am very much puzzled as to what is going 
on. A job that used to run on the same cluster is failing with this 
mysterious message about not having enough disk space when in fact I 
can see through watch df -h that the free space is always hovering 
around 3+GB on the disk and the free inodes are at 50% (this is on 
master). I went through each slave and the spark/work/app*/stderr and 
stdout and spark/logs/*out files and no mention of too many open files 
failures on any of the slaves nor on the master :(


Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:
By default, with P partitions (for both the pre-shuffle stage and 
post-shuffle), there are P^2 files created. 
With spark.shuffle.consolidateFiles turned on, we would instead 
create only P files. Disk space consumption is largely unaffected, 
however. by the number of partitions unless each partition is 
particularly small.


You might look at the actual executors' logs, as it's possible that 
this error was caused by an earlier exception, such as too many open 
files.



On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski 
og...@plainvanillagames.com mailto:og...@plainvanillagames.com wrote:


On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere other
than /tmp if /tmp is full. Actually it's recommended to have
multiple local disks and set to to a comma-separated list of
directories, one per disk.

Matei, does the number of tasks/partitions in a transformation
influence something in terms of disk space consumption? Or inode
consumption?

Thanks,
Ognen





--
A distributed system is one in which the failure of a computer you didn't even know 
existed can render your own computer unusable
-- Leslie Lamport

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson

Thanks for bringing this up, 100% inode utilization is an issue I haven't
seen raised before and this raises another issue which is not on our
current roadmap for state cleanup (cleaning up data which was not fully
cleaned up from a crashed process).


On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski 
og...@plainvanillagames.com wrote:

  Bleh, strike that, one of my slaves was at 100% inode utilization on the
 file system. It was /tmp/spark* leftovers that apparently did not get
 cleaned up properly after failed or interrupted jobs.
 Mental note - run a cron job on all slaves and master to clean up
 /tmp/spark* regularly.

 Thanks (and sorry for the noise)!
 Ognen


 On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:

 Aaron, thanks for replying. I am very much puzzled as to what is going on.
 A job that used to run on the same cluster is failing with this mysterious
 message about not having enough disk space when in fact I can see through
 watch df -h that the free space is always hovering around 3+GB on the
 disk and the free inodes are at 50% (this is on master). I went through
 each slave and the spark/work/app*/stderr and stdout and spark/logs/*out
 files and no mention of too many open files failures on any of the slaves
 nor on the master :(

 Thanks
 Ognen

 On 3/23/14, 8:38 PM, Aaron Davidson wrote:

 By default, with P partitions (for both the pre-shuffle stage and
 post-shuffle), there are P^2 files created.
 With spark.shuffle.consolidateFiles turned on, we would instead create only
 P files. Disk space consumption is largely unaffected, however. by the
 number of partitions unless each partition is particularly small.

  You might look at the actual executors' logs, as it's possible that this
 error was caused by an earlier exception, such as too many open files.


 On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski 
 og...@plainvanillagames.com wrote:

  On 3/23/14, 5:49 PM, Matei Zaharia wrote:

 You can set spark.local.dir to put this data somewhere other than /tmp if
 /tmp is full. Actually it's recommended to have multiple local disks and
 set to to a comma-separated list of directories, one per disk.

  Matei, does the number of tasks/partitions in a transformation influence
 something in terms of disk space consumption? Or inode consumption?

 Thanks,
 Ognen



 --
 A distributed system is one in which the failure of a computer you didn't 
 even know existed can render your own computer unusable
 -- Leslie Lamport

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski

I would love to work on this (and other) stuff if I can bother someone
with questions offline or on a dev mailing list.

Ognen

On 3/23/14, 10:04 PM, Aaron Davidson wrote:
Thanks for bringing this up, 100% inode utilization is an issue I
haven't seen raised before and this raises another issue which is not
on our current roadmap for state cleanup (cleaning up data which was
not fully cleaned up from a crashed process).

On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
og...@plainvanillagames.com mailto:og...@plainvanillagames.com wrote:

Bleh, strike that, one of my slaves was at 100% inode utilization
on the file system. It was /tmp/spark* leftovers that apparently
did not get cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.

Thanks (and sorry for the noise)!
Ognen

On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:

Aaron, thanks for replying. I am very much puzzled as to what is
going on. A job that used to run on the same cluster is failing
with this mysterious message about not having enough disk space
when in fact I can see through watch df -h that the free space
is always hovering around 3+GB on the disk and the free inodes
are at 50% (this is on master). I went through each slave and the
spark/work/app*/stderr and stdout and spark/logs/*out files and
no mention of too many open files failures on any of the slaves
nor on the master :(

Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:

By default, with P partitions (for both the pre-shuffle stage
and post-shuffle), there are P^2 files created.
With spark.shuffle.consolidateFiles turned on, we would instead
create only P files. Disk space consumption is largely
unaffected, however. by the number of partitions unless each
partition is particularly small.

You might look at the actual executors' logs, as it's possible
that this error was caused by an earlier exception, such as too
many open files.

On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
og...@plainvanillagames.com
mailto:og...@plainvanillagames.com wrote:

On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere
other than /tmp if /tmp is full. Actually it's recommended
to have multiple local disks and set to to a
comma-separated list of directories, one per disk.

Matei, does the number of tasks/partitions in a
transformation influence something in terms of disk space
consumption? Or inode consumption?

Thanks,
Ognen

--
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable

-- Leslie Lamport

--
No matter what they ever do to us, we must always act for the love of our people
and the earth. We must not react out of hatred against those who have no sense.
-- John Trudell

Re: No space left on device exception

2014-03-23 Thread Patrick Wendell

Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a No space left on device error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.

On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
og...@nengoiksvelzud.com wrote:
 I would love to work on this (and other) stuff if I can bother someone with
 questions offline or on a dev mailing list.
 Ognen


 On 3/23/14, 10:04 PM, Aaron Davidson wrote:

 Thanks for bringing this up, 100% inode utilization is an issue I haven't
 seen raised before and this raises another issue which is not on our current
 roadmap for state cleanup (cleaning up data which was not fully cleaned up
 from a crashed process).


 On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
 og...@plainvanillagames.com wrote:

 Bleh, strike that, one of my slaves was at 100% inode utilization on the
 file system. It was /tmp/spark* leftovers that apparently did not get
 cleaned up properly after failed or interrupted jobs.
 Mental note - run a cron job on all slaves and master to clean up
 /tmp/spark* regularly.

 Thanks (and sorry for the noise)!
 Ognen


 On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:

 Aaron, thanks for replying. I am very much puzzled as to what is going on.
 A job that used to run on the same cluster is failing with this mysterious
 message about not having enough disk space when in fact I can see through
 watch df -h that the free space is always hovering around 3+GB on the disk
 and the free inodes are at 50% (this is on master). I went through each
 slave and the spark/work/app*/stderr and stdout and spark/logs/*out files
 and no mention of too many open files failures on any of the slaves nor on
 the master :(

 Thanks
 Ognen

 On 3/23/14, 8:38 PM, Aaron Davidson wrote:

 By default, with P partitions (for both the pre-shuffle stage and
 post-shuffle), there are P^2 files created. With
 spark.shuffle.consolidateFiles turned on, we would instead create only P
 files. Disk space consumption is largely unaffected, however. by the number
 of partitions unless each partition is particularly small.

 You might look at the actual executors' logs, as it's possible that this
 error was caused by an earlier exception, such as too many open files.


 On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
 og...@plainvanillagames.com wrote:

 On 3/23/14, 5:49 PM, Matei Zaharia wrote:

 You can set spark.local.dir to put this data somewhere other than /tmp if
 /tmp is full. Actually it's recommended to have multiple local disks and set
 to to a comma-separated list of directories, one per disk.

 Matei, does the number of tasks/partitions in a transformation influence
 something in terms of disk space consumption? Or inode consumption?

 Thanks,
 Ognen



 --
 A distributed system is one in which the failure of a computer you didn't
 even know existed can render your own computer unusable
 -- Leslie Lamport



 --
 No matter what they ever do to us, we must always act for the love of our
 people and the earth. We must not react out of hatred against those who have
 no sense.
 -- John Trudell

81 matches

Mail list logo