Hi Xiangrui,

Thanks. I have taken your advice and set all 5 of my slaves to be
c3.4xlarge. In this case /mnt and /mnt2 have plenty of space by default. I
now do sc.textFile(blah).repartition(N).map(...).cache() with N=80 and
spark.executor.memory to be 20gb and --driver-memory 20g. So far things
seem more stable.

Thanks for the help,
Chris


On Thu, Jul 17, 2014 at 12:56 AM, Xiangrui Meng <men...@gmail.com> wrote:

> Set N be the total number of cores on the cluster or less. sc.textFile
> doesn't always give you that number, depends on the block size. For
> MovieLens, I think the default behavior should be 2~3 partitions. You
> need to call repartition to ensure the right number of partitions.
>
> Which EC2 instance type did you use? I usually use m3.2xlarge or c?
> instances that come with SSD and 1G or 10G network. For those
> instances, you should see local drives mounted at /mnt, /mnt2, /mnt3,
> ... Make sure there was no error when you used the ec2 script to
> launch the cluster.
>
> It is a little strange to see 94% of / was used on a slave. Maybe
> shuffle data went to /. I'm not sure which settings went wrong. I
> recommend trying re-launching a cluster with m3.2xlarge instances and
> using the default settings (don't set anything in SparkConf). Submit
> the application with --driver-memory 20g.
>
> The running times are slower than what I remember, but it depends on
> the instance type.
>
> Best,
> Xiangrui
>
>
>
> On Wed, Jul 16, 2014 at 10:18 PM, Chris DuBois <chris.dub...@gmail.com>
> wrote:
> > Hi Xiangrui,
> >
> > I will try this shortly. When using N partitions, do you recommend N be
> the
> > number of cores on each slave or the number of cores on the master?
> Forgive
> > my ignorance, but is this best achieved as an argument to sc.textFile?
> >
> > The slaves on the EC2 clusters start with only 8gb of storage, and it
> > doesn't seem that /mnt/spark and /mnt2/spark are mounted anywhere else by
> > default. Looking at spark-ec2/setup-slaves.sh, it appears that these are
> > only mounted if the instance type begins with r3. (Or am I not reading
> that
> > right?) My slaves are a different instance type, and currently look like
> > this:
> > Filesystem            Size  Used Avail Use% Mounted on
> > /dev/xvda1            7.9G  7.3G  515M  94% /
> > tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> > /dev/xvdv             500G  2.5G  498G   1% /vol
> >
> > I have been able to finish ALS on MovieLens 10M only twice, taking 221s
> and
> > 315s for 20 iterations at K=20 and lambda=0.01. Does that timing sound
> about
> > right, or does it point to a poor configuration? The same script with
> > MovieLens 1M runs fine in about 30-40s with the same settings. (In both
> > cases I'm training on 70% of the data.)
> >
> > Thanks for your help!
> > Chris
> >
> >
> > On Wed, Jul 16, 2014 at 4:29 PM, Xiangrui Meng <men...@gmail.com> wrote:
> >>
> >> For ALS, I would recommend repartitioning the ratings to match the
> >> number of CPU cores or even less. ALS is not computation heavy for
> >> small k but communication heavy. Having small number of partitions may
> >> help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the
> >> default local directory because they are local hard drives. Did your
> >> last run of ALS on MovieLens 10M-100K with the default settings
> >> succeed? -Xiangrui
> >>
> >> On Wed, Jul 16, 2014 at 8:00 AM, Chris DuBois <chris.dub...@gmail.com>
> >> wrote:
> >> > Hi Xiangrui,
> >> >
> >> > I accidentally did not send df -i for the master node. Here it is at
> the
> >> > moment of failure:
> >> >
> >> > Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> >> > /dev/xvda1            524288  280938  243350   54% /
> >> > tmpfs                3845409       1 3845408    1% /dev/shm
> >> > /dev/xvdb            10002432    1027 10001405    1% /mnt
> >> > /dev/xvdf            10002432      16 10002416    1% /mnt2
> >> > /dev/xvdv            524288000      13 524287987    1% /vol
> >> >
> >> > I am using default settings now, but is there a way to make sure that
> >> > the
> >> > proper directories are being used? How many blocks/partitions do you
> >> > recommend?
> >> >
> >> > Chris
> >> >
> >> >
> >> > On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois <chris.dub...@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> Hi Xiangrui,
> >> >>
> >> >> Here is the result on the master node:
> >> >> $ df -i
> >> >> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> >> >> /dev/xvda1            524288  273997  250291   53% /
> >> >> tmpfs                1917974       1 1917973    1% /dev/shm
> >> >> /dev/xvdv            524288000      30 524287970    1% /vol
> >> >>
> >> >> I have reproduced the error while using the MovieLens 10M data set
> on a
> >> >> newly created cluster.
> >> >>
> >> >> Thanks for the help.
> >> >> Chris
> >> >>
> >> >>
> >> >> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <men...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi Chris,
> >> >>>
> >> >>> Could you also try `df -i` on the master node? How many
> >> >>> blocks/partitions did you set?
> >> >>>
> >> >>> In the current implementation, ALS doesn't clean the shuffle data
> >> >>> because the operations are chained together. But it shouldn't run
> out
> >> >>> of disk space on the MovieLens dataset, which is small. spark-ec2
> >> >>> script sets /mnt/spark and /mnt/spark2 as the local.dir by default,
> I
> >> >>> would recommend leaving this setting as the default value.
> >> >>>
> >> >>> Best,
> >> >>> Xiangrui
> >> >>>
> >> >>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois
> >> >>> <chris.dub...@gmail.com>
> >> >>> wrote:
> >> >>> > Thanks for the quick responses!
> >> >>> >
> >> >>> > I used your final -Dspark.local.dir suggestion, but I see this
> >> >>> > during
> >> >>> > the
> >> >>> > initialization of the application:
> >> >>> >
> >> >>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
> >> >>> > directory at
> >> >>> > /vol/spark-local-20140716065608-7b2a
> >> >>> >
> >> >>> > I would have expected something in /mnt/spark/.
> >> >>> >
> >> >>> > Thanks,
> >> >>> > Chris
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cdg...@cdgore.com>
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Hi Chris,
> >> >>> >>
> >> >>> >> I've encountered this error when running Spark’s ALS methods too.
> >> >>> >> In
> >> >>> >> my
> >> >>> >> case, it was because I set spark.local.dir improperly, and every
> >> >>> >> time
> >> >>> >> there
> >> >>> >> was a shuffle, it would spill many GB of data onto the local
> drive.
> >> >>> >> What
> >> >>> >> fixed it was setting it to use the /mnt directory, where a
> network
> >> >>> >> drive is
> >> >>> >> mounted.  For example, setting an environmental variable:
> >> >>> >>
> >> >>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' |
> xargs
> >> >>> >> |
> >> >>> >> sed
> >> >>> >> 's/ /,/g’)
> >> >>> >>
> >> >>> >> Then adding -Dspark.local.dir=$SPACE or simply
> >> >>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your
> driver
> >> >>> >> application
> >> >>> >>
> >> >>> >> Chris
> >> >>> >>
> >> >>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <men...@gmail.com>
> >> >>> >> wrote:
> >> >>> >>
> >> >>> >> > Check the number of inodes (df -i). The assembly build may
> create
> >> >>> >> > many
> >> >>> >> > small files. -Xiangrui
> >> >>> >> >
> >> >>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois
> >> >>> >> > <chris.dub...@gmail.com>
> >> >>> >> > wrote:
> >> >>> >> >> Hi all,
> >> >>> >> >>
> >> >>> >> >> I am encountering the following error:
> >> >>> >> >>
> >> >>> >> >> INFO scheduler.TaskSetManager: Loss was due to
> >> >>> >> >> java.io.IOException:
> >> >>> >> >> No
> >> >>> >> >> space
> >> >>> >> >> left on device [duplicate 4]
> >> >>> >> >>
> >> >>> >> >> For each slave, df -h looks roughtly like this, which makes
> the
> >> >>> >> >> above
> >> >>> >> >> error
> >> >>> >> >> surprising.
> >> >>> >> >>
> >> >>> >> >> Filesystem            Size  Used Avail Use% Mounted on
> >> >>> >> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
> >> >>> >> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> >> >>> >> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
> >> >>> >> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
> >> >>> >> >> /dev/xvdv             500G   33M  500G   1% /vol
> >> >>> >> >>
> >> >>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched
> using
> >> >>> >> >> the
> >> >>> >> >> spark-ec2 scripts and a clone of spark from today. The job I
> am
> >> >>> >> >> running
> >> >>> >> >> closely resembles the collaborative filtering example. This
> >> >>> >> >> issue
> >> >>> >> >> happens
> >> >>> >> >> with the 1M version as well as the 10 million rating version
> of
> >> >>> >> >> the
> >> >>> >> >> MovieLens dataset.
> >> >>> >> >>
> >> >>> >> >> I have seen previous questions, but they haven't helped yet.
> For
> >> >>> >> >> example, I
> >> >>> >> >> tried setting the Spark tmp directory to the EBS volume at
> >> >>> >> >> /vol/,
> >> >>> >> >> both
> >> >>> >> >> by
> >> >>> >> >> editing the spark conf file (and copy-dir'ing it to the
> slaves)
> >> >>> >> >> as
> >> >>> >> >> well
> >> >>> >> >> as
> >> >>> >> >> through the SparkConf. Yet I still get the above error. Here
> is
> >> >>> >> >> my
> >> >>> >> >> current
> >> >>> >> >> Spark config below. Note that I'm launching via
> >> >>> >> >> ~/spark/bin/spark-submit.
> >> >>> >> >>
> >> >>> >> >> conf = SparkConf()
> >> >>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir",
> >> >>> >> >> "/vol/").set("spark.executor.memory",
> >> >>> >> >> "7g").set("spark.akka.frameSize",
> >> >>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", "
> >> >>> >> >> -Dspark.akka.frameSize=100")
> >> >>> >> >> sc = SparkContext(conf=conf)
> >> >>> >> >>
> >> >>> >> >> Thanks for any advice,
> >> >>> >> >> Chris
> >> >>> >> >>
> >> >>> >>
> >> >>> >
> >> >>
> >> >>
> >> >
> >
> >
>

Reply via email to