Re: hash vs sort shuffle

2014-09-22 Thread Patrick Wendell
Hey Cody,

In terms of Spark 1.1.1 - we wouldn't change a default value in a spot
release. Changing this to default is slotted for 1.2.0:

https://issues.apache.org/jira/browse/SPARK-3280

- Patrick

On Mon, Sep 22, 2014 at 9:08 AM, Cody Koeninger  wrote:
> Unfortunately we were somewhat rushed to get things working again and did
> not keep the exact stacktraces, but one of the issues we saw was similar to
> that reported in
>
> https://issues.apache.org/jira/browse/SPARK-3032
>
> We also saw FAILED_TO_UNCOMPRESS errors from snappy when reading the
> shuffle file.
>
>
>
> On Mon, Sep 22, 2014 at 10:54 AM, Sandy Ryza 
> wrote:
>
>> Thanks for the heads up Cody.  Any indication of what was going wrong?
>>
>> On Mon, Sep 22, 2014 at 7:16 AM, Cody Koeninger 
>> wrote:
>>
>>> Just as a heads up, we deployed 471e6a3a of master (in order to get some
>>> sql fixes), and were seeing jobs fail until we set
>>>
>>> spark.shuffle.manager=HASH
>>>
>>> I'd be reluctant to change the default to sort for the 1.1.1 release
>>>
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



OutOfMemoryError on parquet SnappyDecompressor

2014-09-22 Thread Cody Koeninger
After commit 8856c3d8 switched from gzip to snappy as default parquet
compression codec, I'm seeing the following when trying to read parquet
files saved using the new default (same schema and roughly same size as
files that were previously working):

java.lang.OutOfMemoryError: Direct buffer memory
java.nio.Bits.reserveMemory(Bits.java:658)
java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)

parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99)

parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43)
java.io.DataInputStream.readFully(DataInputStream.java:195)
java.io.DataInputStream.readFully(DataInputStream.java:169)

parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201)

parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521)

parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)

parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546)

parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:339)

parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)

parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)

parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:265)
parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60)
parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)

parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)

parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)

parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)

org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)

org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220)

org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:722)


Re: guava version conflicts

2014-09-22 Thread Marcelo Vanzin
FYI I filed SPARK-3647 to track the fix (some people internally have
bumped into this also).

On Mon, Sep 22, 2014 at 1:28 PM, Cody Koeninger  wrote:
> We've worked around it for the meantime by excluding guava from transitive
> dependencies in the job assembly and specifying the same version of guava 14
> that spark is using.  Obviously things break whenever a guava 15 / 16
> feature is used at runtime, so a long term solution is needed.
>
> On Mon, Sep 22, 2014 at 3:13 PM, Marcelo Vanzin  wrote:
>>
>> Hmmm, a quick look at the code indicates this should work for
>> executors, but not for the driver... (maybe this deserves a bug being
>> filed, if there isn't one already?)
>>
>> If it's feasible for you, you could remove the Optional.class file
>> from the Spark assembly you're using.
>>
>> On Mon, Sep 22, 2014 at 12:46 PM, Cody Koeninger 
>> wrote:
>> > We're using Mesos, is there a reasonable expectation that
>> > spark.files.userClassPathFirst will actually work?
>> >
>> > On Mon, Sep 22, 2014 at 1:42 PM, Marcelo Vanzin 
>> > wrote:
>> >>
>> >> Hi Cody,
>> >>
>> >> I'm still writing a test to make sure I understood exactly what's
>> >> going on here, but from looking at the stack trace, it seems like the
>> >> newer Guava library is picking up the "Optional" class from the Spark
>> >> assembly.
>> >>
>> >> Could you try one of the options that put the user's classpath before
>> >> the Spark assembly? (spark.files.userClassPathFirst or
>> >> spark.yarn.user.classpath.first depending on which master you're
>> >> running)
>> >>
>> >> People seem to have run into issues with those options in the past,
>> >> but if they work for you, then Guava should pick its own Optional
>> >> class (instead of the one shipped with Spark) and things should then
>> >> work.
>> >>
>> >> I'll investigate a way to fix it in Spark in the meantime.
>> >>
>> >>
>> >> On Fri, Sep 19, 2014 at 10:30 PM, Cody Koeninger 
>> >> wrote:
>> >> > After the recent spark project changes to guava shading, I'm seeing
>> >> > issues
>> >> > with the datastax spark cassandra connector (which depends on guava
>> >> > 15.0)
>> >> > and the datastax cql driver (which depends on guava 16.0.1)
>> >> >
>> >> > Building an assembly for a job (with spark marked as provided) that
>> >> > includes either guava 15.0 or 16.0.1, results in errors like the
>> >> > following:
>> >> >
>> >> > scala> session.close
>> >> >
>> >> > scala> s[14/09/20 04:56:35 ERROR Futures$CombinedFuture: input future
>> >> > failed.
>> >> > java.lang.IllegalAccessError: tried to access class
>> >> > org.spark-project.guava.common.base.Absent from class
>> >> > com.google.common.base.Optional
>> >> > at com.google.common.base.Optional.absent(Optional.java:79)
>> >> > at
>> >> > com.google.common.base.Optional.fromNullable(Optional.java:94)
>> >> > at
>> >> >
>> >> >
>> >> > com.google.common.util.concurrent.Futures$CombinedFuture.setOneValue(Futures.java:1608)
>> >> > at
>> >> >
>> >> >
>> >> > com.google.common.util.concurrent.Futures$CombinedFuture.access$400(Futures.java:1470)
>> >> > at
>> >> >
>> >> >
>> >> > com.google.common.util.concurrent.Futures$CombinedFuture$2.run(Futures.java:1548)
>> >> > at
>> >> >
>> >> >
>> >> > com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>> >> > at
>> >> >
>> >> >
>> >> > com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
>> >> > at
>> >> >
>> >> >
>> >> > com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:101)
>> >> > at
>> >> >
>> >> >
>> >> > com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:170)
>> >> > at
>> >> >
>> >> >
>> >> > com.google.common.util.concurrent.Futures$CombinedFuture.init(Futures.java:1545)
>> >> > at
>> >> >
>> >> >
>> >> > com.google.common.util.concurrent.Futures$CombinedFuture.(Futures.java:1491)
>> >> > at
>> >> >
>> >> > com.google.common.util.concurrent.Futures.listFuture(Futures.java:1640)
>> >> > at
>> >> > com.google.common.util.concurrent.Futures.allAsList(Futures.java:983)
>> >> > at
>> >> >
>> >> >
>> >> > com.datastax.driver.core.CloseFuture$Forwarding.(CloseFuture.java:73)
>> >> > at
>> >> >
>> >> >
>> >> > com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:398)
>> >> > at
>> >> >
>> >> >
>> >> > com.datastax.driver.core.SessionManager.closeAsync(SessionManager.java:157)
>> >> > at
>> >> >
>> >> > com.datastax.driver.core.SessionManager.close(SessionManager.java:172)
>> >> > at
>> >> >
>> >> >
>> >> > com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$destroySession(CassandraConnector.scala:180)
>> >> > at
>> >> >
>> >> >
>> >> > com.datastax.spark.connector.cql.CassandraConnector$$anonfun$5.apply(CassandraConnector.scala:151)
>> >

Re: guava version conflicts

2014-09-22 Thread Cody Koeninger
We've worked around it for the meantime by excluding guava from transitive
dependencies in the job assembly and specifying the same version of guava
14 that spark is using.  Obviously things break whenever a guava 15 / 16
feature is used at runtime, so a long term solution is needed.

On Mon, Sep 22, 2014 at 3:13 PM, Marcelo Vanzin  wrote:

> Hmmm, a quick look at the code indicates this should work for
> executors, but not for the driver... (maybe this deserves a bug being
> filed, if there isn't one already?)
>
> If it's feasible for you, you could remove the Optional.class file
> from the Spark assembly you're using.
>
> On Mon, Sep 22, 2014 at 12:46 PM, Cody Koeninger 
> wrote:
> > We're using Mesos, is there a reasonable expectation that
> > spark.files.userClassPathFirst will actually work?
> >
> > On Mon, Sep 22, 2014 at 1:42 PM, Marcelo Vanzin 
> wrote:
> >>
> >> Hi Cody,
> >>
> >> I'm still writing a test to make sure I understood exactly what's
> >> going on here, but from looking at the stack trace, it seems like the
> >> newer Guava library is picking up the "Optional" class from the Spark
> >> assembly.
> >>
> >> Could you try one of the options that put the user's classpath before
> >> the Spark assembly? (spark.files.userClassPathFirst or
> >> spark.yarn.user.classpath.first depending on which master you're
> >> running)
> >>
> >> People seem to have run into issues with those options in the past,
> >> but if they work for you, then Guava should pick its own Optional
> >> class (instead of the one shipped with Spark) and things should then
> >> work.
> >>
> >> I'll investigate a way to fix it in Spark in the meantime.
> >>
> >>
> >> On Fri, Sep 19, 2014 at 10:30 PM, Cody Koeninger 
> >> wrote:
> >> > After the recent spark project changes to guava shading, I'm seeing
> >> > issues
> >> > with the datastax spark cassandra connector (which depends on guava
> >> > 15.0)
> >> > and the datastax cql driver (which depends on guava 16.0.1)
> >> >
> >> > Building an assembly for a job (with spark marked as provided) that
> >> > includes either guava 15.0 or 16.0.1, results in errors like the
> >> > following:
> >> >
> >> > scala> session.close
> >> >
> >> > scala> s[14/09/20 04:56:35 ERROR Futures$CombinedFuture: input future
> >> > failed.
> >> > java.lang.IllegalAccessError: tried to access class
> >> > org.spark-project.guava.common.base.Absent from class
> >> > com.google.common.base.Optional
> >> > at com.google.common.base.Optional.absent(Optional.java:79)
> >> > at
> >> > com.google.common.base.Optional.fromNullable(Optional.java:94)
> >> > at
> >> >
> >> >
> com.google.common.util.concurrent.Futures$CombinedFuture.setOneValue(Futures.java:1608)
> >> > at
> >> >
> >> >
> com.google.common.util.concurrent.Futures$CombinedFuture.access$400(Futures.java:1470)
> >> > at
> >> >
> >> >
> com.google.common.util.concurrent.Futures$CombinedFuture$2.run(Futures.java:1548)
> >> > at
> >> >
> >> >
> com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
> >> > at
> >> >
> >> >
> com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
> >> > at
> >> >
> >> >
> com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:101)
> >> > at
> >> >
> >> >
> com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:170)
> >> > at
> >> >
> >> >
> com.google.common.util.concurrent.Futures$CombinedFuture.init(Futures.java:1545)
> >> > at
> >> >
> >> >
> com.google.common.util.concurrent.Futures$CombinedFuture.(Futures.java:1491)
> >> > at
> >> >
> com.google.common.util.concurrent.Futures.listFuture(Futures.java:1640)
> >> > at
> >> > com.google.common.util.concurrent.Futures.allAsList(Futures.java:983)
> >> > at
> >> >
> >> >
> com.datastax.driver.core.CloseFuture$Forwarding.(CloseFuture.java:73)
> >> > at
> >> >
> >> >
> com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:398)
> >> > at
> >> >
> >> >
> com.datastax.driver.core.SessionManager.closeAsync(SessionManager.java:157)
> >> > at
> >> > com.datastax.driver.core.SessionManager.close(SessionManager.java:172)
> >> > at
> >> >
> >> >
> com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$destroySession(CassandraConnector.scala:180)
> >> > at
> >> >
> >> >
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$5.apply(CassandraConnector.scala:151)
> >> > at
> >> >
> >> >
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$5.apply(CassandraConnector.scala:151)
> >> > at com.datastax.spark.connector.cql.RefCountedCache.com
> >> >
> >> >
> $datastax$spark$connector$cql$RefCountedCache$$releaseImmediately(RefCountedCache.scala:86)
> >> > at
> >> >
> >> >
> com.datastax.spark.connect

Re: guava version conflicts

2014-09-22 Thread Marcelo Vanzin
Hmmm, a quick look at the code indicates this should work for
executors, but not for the driver... (maybe this deserves a bug being
filed, if there isn't one already?)

If it's feasible for you, you could remove the Optional.class file
from the Spark assembly you're using.

On Mon, Sep 22, 2014 at 12:46 PM, Cody Koeninger  wrote:
> We're using Mesos, is there a reasonable expectation that
> spark.files.userClassPathFirst will actually work?
>
> On Mon, Sep 22, 2014 at 1:42 PM, Marcelo Vanzin  wrote:
>>
>> Hi Cody,
>>
>> I'm still writing a test to make sure I understood exactly what's
>> going on here, but from looking at the stack trace, it seems like the
>> newer Guava library is picking up the "Optional" class from the Spark
>> assembly.
>>
>> Could you try one of the options that put the user's classpath before
>> the Spark assembly? (spark.files.userClassPathFirst or
>> spark.yarn.user.classpath.first depending on which master you're
>> running)
>>
>> People seem to have run into issues with those options in the past,
>> but if they work for you, then Guava should pick its own Optional
>> class (instead of the one shipped with Spark) and things should then
>> work.
>>
>> I'll investigate a way to fix it in Spark in the meantime.
>>
>>
>> On Fri, Sep 19, 2014 at 10:30 PM, Cody Koeninger 
>> wrote:
>> > After the recent spark project changes to guava shading, I'm seeing
>> > issues
>> > with the datastax spark cassandra connector (which depends on guava
>> > 15.0)
>> > and the datastax cql driver (which depends on guava 16.0.1)
>> >
>> > Building an assembly for a job (with spark marked as provided) that
>> > includes either guava 15.0 or 16.0.1, results in errors like the
>> > following:
>> >
>> > scala> session.close
>> >
>> > scala> s[14/09/20 04:56:35 ERROR Futures$CombinedFuture: input future
>> > failed.
>> > java.lang.IllegalAccessError: tried to access class
>> > org.spark-project.guava.common.base.Absent from class
>> > com.google.common.base.Optional
>> > at com.google.common.base.Optional.absent(Optional.java:79)
>> > at
>> > com.google.common.base.Optional.fromNullable(Optional.java:94)
>> > at
>> >
>> > com.google.common.util.concurrent.Futures$CombinedFuture.setOneValue(Futures.java:1608)
>> > at
>> >
>> > com.google.common.util.concurrent.Futures$CombinedFuture.access$400(Futures.java:1470)
>> > at
>> >
>> > com.google.common.util.concurrent.Futures$CombinedFuture$2.run(Futures.java:1548)
>> > at
>> >
>> > com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>> > at
>> >
>> > com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
>> > at
>> >
>> > com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:101)
>> > at
>> >
>> > com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:170)
>> > at
>> >
>> > com.google.common.util.concurrent.Futures$CombinedFuture.init(Futures.java:1545)
>> > at
>> >
>> > com.google.common.util.concurrent.Futures$CombinedFuture.(Futures.java:1491)
>> > at
>> > com.google.common.util.concurrent.Futures.listFuture(Futures.java:1640)
>> > at
>> > com.google.common.util.concurrent.Futures.allAsList(Futures.java:983)
>> > at
>> >
>> > com.datastax.driver.core.CloseFuture$Forwarding.(CloseFuture.java:73)
>> > at
>> >
>> > com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:398)
>> > at
>> >
>> > com.datastax.driver.core.SessionManager.closeAsync(SessionManager.java:157)
>> > at
>> > com.datastax.driver.core.SessionManager.close(SessionManager.java:172)
>> > at
>> >
>> > com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$destroySession(CassandraConnector.scala:180)
>> > at
>> >
>> > com.datastax.spark.connector.cql.CassandraConnector$$anonfun$5.apply(CassandraConnector.scala:151)
>> > at
>> >
>> > com.datastax.spark.connector.cql.CassandraConnector$$anonfun$5.apply(CassandraConnector.scala:151)
>> > at com.datastax.spark.connector.cql.RefCountedCache.com
>> >
>> > $datastax$spark$connector$cql$RefCountedCache$$releaseImmediately(RefCountedCache.scala:86)
>> > at
>> >
>> > com.datastax.spark.connector.cql.RefCountedCache$ReleaseTask.run(RefCountedCache.scala:26)
>> > at
>> >
>> > com.datastax.spark.connector.cql.RefCountedCache$$anonfun$com$datastax$spark$connector$cql$RefCountedCache$$processPendingReleases$2.apply(RefCountedCache.scala:150)
>> > at
>> >
>> > com.datastax.spark.connector.cql.RefCountedCache$$anonfun$com$datastax$spark$connector$cql$RefCountedCache$$processPendingReleases$2.apply(RefCountedCache.scala:147)
>> > at
>> >
>> > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>> > at scala.collection.I

Re: guava version conflicts

2014-09-22 Thread Cody Koeninger
We're using Mesos, is there a reasonable expectation that
spark.files.userClassPathFirst will actually work?

On Mon, Sep 22, 2014 at 1:42 PM, Marcelo Vanzin  wrote:

> Hi Cody,
>
> I'm still writing a test to make sure I understood exactly what's
> going on here, but from looking at the stack trace, it seems like the
> newer Guava library is picking up the "Optional" class from the Spark
> assembly.
>
> Could you try one of the options that put the user's classpath before
> the Spark assembly? (spark.files.userClassPathFirst or
> spark.yarn.user.classpath.first depending on which master you're
> running)
>
> People seem to have run into issues with those options in the past,
> but if they work for you, then Guava should pick its own Optional
> class (instead of the one shipped with Spark) and things should then
> work.
>
> I'll investigate a way to fix it in Spark in the meantime.
>
>
> On Fri, Sep 19, 2014 at 10:30 PM, Cody Koeninger 
> wrote:
> > After the recent spark project changes to guava shading, I'm seeing
> issues
> > with the datastax spark cassandra connector (which depends on guava 15.0)
> > and the datastax cql driver (which depends on guava 16.0.1)
> >
> > Building an assembly for a job (with spark marked as provided) that
> > includes either guava 15.0 or 16.0.1, results in errors like the
> following:
> >
> > scala> session.close
> >
> > scala> s[14/09/20 04:56:35 ERROR Futures$CombinedFuture: input future
> > failed.
> > java.lang.IllegalAccessError: tried to access class
> > org.spark-project.guava.common.base.Absent from class
> > com.google.common.base.Optional
> > at com.google.common.base.Optional.absent(Optional.java:79)
> > at com.google.common.base.Optional.fromNullable(Optional.java:94)
> > at
> >
> com.google.common.util.concurrent.Futures$CombinedFuture.setOneValue(Futures.java:1608)
> > at
> >
> com.google.common.util.concurrent.Futures$CombinedFuture.access$400(Futures.java:1470)
> > at
> >
> com.google.common.util.concurrent.Futures$CombinedFuture$2.run(Futures.java:1548)
> > at
> >
> com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
> > at
> >
> com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
> > at
> >
> com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:101)
> > at
> >
> com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:170)
> > at
> >
> com.google.common.util.concurrent.Futures$CombinedFuture.init(Futures.java:1545)
> > at
> >
> com.google.common.util.concurrent.Futures$CombinedFuture.(Futures.java:1491)
> > at
> > com.google.common.util.concurrent.Futures.listFuture(Futures.java:1640)
> > at
> > com.google.common.util.concurrent.Futures.allAsList(Futures.java:983)
> > at
> >
> com.datastax.driver.core.CloseFuture$Forwarding.(CloseFuture.java:73)
> > at
> >
> com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:398)
> > at
> >
> com.datastax.driver.core.SessionManager.closeAsync(SessionManager.java:157)
> > at
> > com.datastax.driver.core.SessionManager.close(SessionManager.java:172)
> > at
> >
> com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$destroySession(CassandraConnector.scala:180)
> > at
> >
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$5.apply(CassandraConnector.scala:151)
> > at
> >
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$5.apply(CassandraConnector.scala:151)
> > at com.datastax.spark.connector.cql.RefCountedCache.com
> >
> $datastax$spark$connector$cql$RefCountedCache$$releaseImmediately(RefCountedCache.scala:86)
> > at
> >
> com.datastax.spark.connector.cql.RefCountedCache$ReleaseTask.run(RefCountedCache.scala:26)
> > at
> >
> com.datastax.spark.connector.cql.RefCountedCache$$anonfun$com$datastax$spark$connector$cql$RefCountedCache$$processPendingReleases$2.apply(RefCountedCache.scala:150)
> > at
> >
> com.datastax.spark.connector.cql.RefCountedCache$$anonfun$com$datastax$spark$connector$cql$RefCountedCache$$processPendingReleases$2.apply(RefCountedCache.scala:147)
> > at
> >
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > at
> > scala.collection.concurrent.TrieMapIterator.foreach(TrieMap.scala:922)
> > at
> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> > at scala.collection.concurrent.TrieMap.foreach(TrieMap.scala:632)
> > at
> >
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> > at com.datastax.spark.connector.cql.RefCountedCache.com
> >
> $datastax$spark$connector$cql$Re

Re: Support for Hive buckets

2014-09-22 Thread Michael Armbrust
Hi Cody,

There are currently no concrete plans for adding buckets to Spark SQL, but
thats mostly due to lack of resources / demand for this feature.  Adding
full support is probably a fair amount of work since you'd have to make
changes throughout parsing/optimization/execution.  That said, there are
probably some smaller tasks that could be easier (for example, you might be
able to avoid a shuffle when doing joins on tables that are already
bucketed by exposing more metastore information to the planner).

Michael

On Sun, Sep 14, 2014 at 3:10 PM, Cody Koeninger  wrote:

> I noticed that the release notes for 1.1.0 said that spark doesn't support
> Hive buckets "yet".  I didn't notice any jira issues related to adding
> support.
>
> Broadly speaking, what would be involved in supporting buckets, especially
> the bucketmapjoin and sortedmerge optimizations?
>


Re: guava version conflicts

2014-09-22 Thread Marcelo Vanzin
Hi Cody,

I'm still writing a test to make sure I understood exactly what's
going on here, but from looking at the stack trace, it seems like the
newer Guava library is picking up the "Optional" class from the Spark
assembly.

Could you try one of the options that put the user's classpath before
the Spark assembly? (spark.files.userClassPathFirst or
spark.yarn.user.classpath.first depending on which master you're
running)

People seem to have run into issues with those options in the past,
but if they work for you, then Guava should pick its own Optional
class (instead of the one shipped with Spark) and things should then
work.

I'll investigate a way to fix it in Spark in the meantime.


On Fri, Sep 19, 2014 at 10:30 PM, Cody Koeninger  wrote:
> After the recent spark project changes to guava shading, I'm seeing issues
> with the datastax spark cassandra connector (which depends on guava 15.0)
> and the datastax cql driver (which depends on guava 16.0.1)
>
> Building an assembly for a job (with spark marked as provided) that
> includes either guava 15.0 or 16.0.1, results in errors like the following:
>
> scala> session.close
>
> scala> s[14/09/20 04:56:35 ERROR Futures$CombinedFuture: input future
> failed.
> java.lang.IllegalAccessError: tried to access class
> org.spark-project.guava.common.base.Absent from class
> com.google.common.base.Optional
> at com.google.common.base.Optional.absent(Optional.java:79)
> at com.google.common.base.Optional.fromNullable(Optional.java:94)
> at
> com.google.common.util.concurrent.Futures$CombinedFuture.setOneValue(Futures.java:1608)
> at
> com.google.common.util.concurrent.Futures$CombinedFuture.access$400(Futures.java:1470)
> at
> com.google.common.util.concurrent.Futures$CombinedFuture$2.run(Futures.java:1548)
> at
> com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
> at
> com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
> at
> com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:101)
> at
> com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:170)
> at
> com.google.common.util.concurrent.Futures$CombinedFuture.init(Futures.java:1545)
> at
> com.google.common.util.concurrent.Futures$CombinedFuture.(Futures.java:1491)
> at
> com.google.common.util.concurrent.Futures.listFuture(Futures.java:1640)
> at
> com.google.common.util.concurrent.Futures.allAsList(Futures.java:983)
> at
> com.datastax.driver.core.CloseFuture$Forwarding.(CloseFuture.java:73)
> at
> com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:398)
> at
> com.datastax.driver.core.SessionManager.closeAsync(SessionManager.java:157)
> at
> com.datastax.driver.core.SessionManager.close(SessionManager.java:172)
> at
> com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$destroySession(CassandraConnector.scala:180)
> at
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$5.apply(CassandraConnector.scala:151)
> at
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$5.apply(CassandraConnector.scala:151)
> at com.datastax.spark.connector.cql.RefCountedCache.com
> $datastax$spark$connector$cql$RefCountedCache$$releaseImmediately(RefCountedCache.scala:86)
> at
> com.datastax.spark.connector.cql.RefCountedCache$ReleaseTask.run(RefCountedCache.scala:26)
> at
> com.datastax.spark.connector.cql.RefCountedCache$$anonfun$com$datastax$spark$connector$cql$RefCountedCache$$processPendingReleases$2.apply(RefCountedCache.scala:150)
> at
> com.datastax.spark.connector.cql.RefCountedCache$$anonfun$com$datastax$spark$connector$cql$RefCountedCache$$processPendingReleases$2.apply(RefCountedCache.scala:147)
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at
> scala.collection.concurrent.TrieMapIterator.foreach(TrieMap.scala:922)
> at
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.concurrent.TrieMap.foreach(TrieMap.scala:632)
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at com.datastax.spark.connector.cql.RefCountedCache.com
> $datastax$spark$connector$cql$RefCountedCache$$processPendingReleases(RefCountedCache.scala:147)
> at
> com.datastax.spark.connector.cql.RefCountedCache$$anon$1.run(RefCountedCache.scala:157)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
> at java.util.concurrent.FutureTask.runAndR

Re: A couple questions about shared variables

2014-09-22 Thread Nan Zhu
I see, thanks for pointing this out  


--  
Nan Zhu


On Monday, September 22, 2014 at 12:08 PM, Sandy Ryza wrote:

> MapReduce counters do not count duplications.  In MapReduce, if a task needs 
> to be re-run, the value of the counter from the second task overwrites the 
> value from the first task.
>  
> -Sandy
>  
> On Mon, Sep 22, 2014 at 4:55 AM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> > If you think it as necessary to fix, I would like to resubmit that PR 
> > (seems to have some conflicts with the current DAGScheduler)  
> >  
> > My suggestion is to make it as an option in accumulator, e.g. some 
> > algorithms utilizing accumulator for result calculation, it needs a 
> > deterministic accumulator, while others implementing something like Hadoop 
> > counters may need the current implementation (count everything happened, 
> > including the duplications)
> >  
> > Your thoughts?  
> >  
> > --  
> > Nan Zhu
> >  
> >  
> > On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote:
> >  
> > > Hmm, good point, this seems to have been broken by refactorings of the 
> > > scheduler, but it worked in the past. Basically the solution is simple -- 
> > > in a result stage, we should not apply the update for each task ID more 
> > > than once -- the same way we don't call job.listener.taskSucceeded more 
> > > than once. Your PR also tried to avoid this for resubmitted shuffle 
> > > stages, but I don't think we need to do that necessarily (though we 
> > > could).
> > >  
> > > Matei  
> > >  
> > > On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com 
> > > (mailto:zhunanmcg...@gmail.com)) wrote:
> > >  
> > > > Hi, Matei,  
> > > >  
> > > > Can you give some hint on how the current implementation guarantee the 
> > > > accumulator is only applied for once?  
> > > >  
> > > > There is a pending PR trying to achieving this 
> > > > (https://github.com/apache/spark/pull/228/files), but from the current 
> > > > implementation, I didn’t see this has been done? (maybe I missed 
> > > > something)  
> > > >  
> > > > Best,  
> > > >  
> > > > --   
> > > > Nan Zhu
> > > >  
> > > >  
> > > > On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:
> > > >  
> > > > > Hey Sandy,
> > > > >  
> > > > > On September 20, 2014 at 8:50:54 AM, Sandy Ryza 
> > > > > (sandy.r...@cloudera.com (mailto:sandy.r...@cloudera.com)) wrote:  
> > > > >  
> > > > > Hey All,   
> > > > >  
> > > > > A couple questions came up about shared variables recently, and I 
> > > > > wanted to   
> > > > > confirm my understanding and update the doc to be a little more 
> > > > > clear.  
> > > > >  
> > > > > *Broadcast variables*   
> > > > > Now that tasks data is automatically broadcast, the only occasions 
> > > > > where it  
> > > > > makes sense to explicitly broadcast are:  
> > > > > * You want to use a variable from tasks in multiple stages.  
> > > > > * You want to have the variable stored on the executors in 
> > > > > deserialized  
> > > > > form.  
> > > > > * You want tasks to be able to modify the variable and have those  
> > > > > modifications take effect for other tasks running on the same 
> > > > > executor  
> > > > > (usually a very bad idea).  
> > > > >  
> > > > > Is that right?   
> > > > > Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also 
> > > > > matters. (We might later factor tasks in a different way to avoid 2, 
> > > > > but it's hard due to things like Hadoop JobConf objects in the tasks).
> > > > >  
> > > > >  
> > > > > *Accumulators*   
> > > > > Values are only counted for successful tasks. Is that right? KMeans 
> > > > > seems  
> > > > > to use it in this way. What happens if a node goes away and 
> > > > > successful  
> > > > > tasks need to be resubmitted? Or the stage runs again because a 
> > > > > different  
> > > > > job needed it.  
> > > > > Accumulators are guaranteed to give a deterministic result if you 
> > > > > only increment them in actions. For each result stage, the 
> > > > > accumulator's update from each task is only applied once, even if 
> > > > > that task runs multiple times. If you use accumulators in 
> > > > > transformations (i.e. in a stage that may be part of multiple jobs), 
> > > > > then you may see multiple updates, from each run. This is kind of 
> > > > > confusing but it was useful for people who wanted to use these for 
> > > > > debugging.
> > > > >  
> > > > > Matei  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > thanks,   
> > > > > Sandy  
> > > > >  
> > > > >  
> > > > >  
> > > >  
> > > >  
> >  
>  



spark_classpath in core/pom.xml and yarn/porm.xml

2014-09-22 Thread Ye Xianjin
Hi:
I notice the scalatest-maven-plugin set SPARK_CLASSPATH environment 
variable for testing. But in the SparkConf.scala, this is deprecated in Spark 
1.0+.
So what this variable for? should we just remove this variable?


-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)



Re: guava version conflicts

2014-09-22 Thread Gary Malouf
Hi Marcelo,

Interested to hear the approach to be taken.  Shading guava itself seems
extreme, but that might make sense.

Gary

On Sat, Sep 20, 2014 at 9:38 PM, Marcelo Vanzin  wrote:

> Hmm, looks like the hack to maintain backwards compatibility in the
> Java API didn't work that well. I'll take a closer look at this when I
> get to work on Monday.
>
> On Fri, Sep 19, 2014 at 10:30 PM, Cody Koeninger 
> wrote:
> > After the recent spark project changes to guava shading, I'm seeing
> issues
> > with the datastax spark cassandra connector (which depends on guava 15.0)
> > and the datastax cql driver (which depends on guava 16.0.1)
> >
> > Building an assembly for a job (with spark marked as provided) that
> > includes either guava 15.0 or 16.0.1, results in errors like the
> following:
> >
> > scala> session.close
> >
> > scala> s[14/09/20 04:56:35 ERROR Futures$CombinedFuture: input future
> > failed.
> > java.lang.IllegalAccessError: tried to access class
> > org.spark-project.guava.common.base.Absent from class
> > com.google.common.base.Optional
> > at com.google.common.base.Optional.absent(Optional.java:79)
> > at com.google.common.base.Optional.fromNullable(Optional.java:94)
> > at
> >
> com.google.common.util.concurrent.Futures$CombinedFuture.setOneValue(Futures.java:1608)
> > at
> >
> com.google.common.util.concurrent.Futures$CombinedFuture.access$400(Futures.java:1470)
> > at
> >
> com.google.common.util.concurrent.Futures$CombinedFuture$2.run(Futures.java:1548)
> > at
> >
> com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
> > at
> >
> com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
> > at
> >
> com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:101)
> > at
> >
> com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:170)
> > at
> >
> com.google.common.util.concurrent.Futures$CombinedFuture.init(Futures.java:1545)
> > at
> >
> com.google.common.util.concurrent.Futures$CombinedFuture.(Futures.java:1491)
> > at
> > com.google.common.util.concurrent.Futures.listFuture(Futures.java:1640)
> > at
> > com.google.common.util.concurrent.Futures.allAsList(Futures.java:983)
> > at
> >
> com.datastax.driver.core.CloseFuture$Forwarding.(CloseFuture.java:73)
> > at
> >
> com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:398)
> > at
> >
> com.datastax.driver.core.SessionManager.closeAsync(SessionManager.java:157)
> > at
> > com.datastax.driver.core.SessionManager.close(SessionManager.java:172)
> > at
> >
> com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$destroySession(CassandraConnector.scala:180)
> > at
> >
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$5.apply(CassandraConnector.scala:151)
> > at
> >
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$5.apply(CassandraConnector.scala:151)
> > at com.datastax.spark.connector.cql.RefCountedCache.com
> >
> $datastax$spark$connector$cql$RefCountedCache$$releaseImmediately(RefCountedCache.scala:86)
> > at
> >
> com.datastax.spark.connector.cql.RefCountedCache$ReleaseTask.run(RefCountedCache.scala:26)
> > at
> >
> com.datastax.spark.connector.cql.RefCountedCache$$anonfun$com$datastax$spark$connector$cql$RefCountedCache$$processPendingReleases$2.apply(RefCountedCache.scala:150)
> > at
> >
> com.datastax.spark.connector.cql.RefCountedCache$$anonfun$com$datastax$spark$connector$cql$RefCountedCache$$processPendingReleases$2.apply(RefCountedCache.scala:147)
> > at
> >
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > at
> > scala.collection.concurrent.TrieMapIterator.foreach(TrieMap.scala:922)
> > at
> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> > at scala.collection.concurrent.TrieMap.foreach(TrieMap.scala:632)
> > at
> >
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> > at com.datastax.spark.connector.cql.RefCountedCache.com
> >
> $datastax$spark$connector$cql$RefCountedCache$$processPendingReleases(RefCountedCache.scala:147)
> > at
> >
> com.datastax.spark.connector.cql.RefCountedCache$$anon$1.run(RefCountedCache.scala:157)
> > at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > at
> >
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
> > at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
> > at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301

Re: hash vs sort shuffle

2014-09-22 Thread Cody Koeninger
Unfortunately we were somewhat rushed to get things working again and did
not keep the exact stacktraces, but one of the issues we saw was similar to
that reported in

https://issues.apache.org/jira/browse/SPARK-3032

We also saw FAILED_TO_UNCOMPRESS errors from snappy when reading the
shuffle file.



On Mon, Sep 22, 2014 at 10:54 AM, Sandy Ryza 
wrote:

> Thanks for the heads up Cody.  Any indication of what was going wrong?
>
> On Mon, Sep 22, 2014 at 7:16 AM, Cody Koeninger 
> wrote:
>
>> Just as a heads up, we deployed 471e6a3a of master (in order to get some
>> sql fixes), and were seeing jobs fail until we set
>>
>> spark.shuffle.manager=HASH
>>
>> I'd be reluctant to change the default to sort for the 1.1.1 release
>>
>
>


Re: A couple questions about shared variables

2014-09-22 Thread Sandy Ryza
MapReduce counters do not count duplications.  In MapReduce, if a task
needs to be re-run, the value of the counter from the second task
overwrites the value from the first task.

-Sandy

On Mon, Sep 22, 2014 at 4:55 AM, Nan Zhu  wrote:

>  If you think it as necessary to fix, I would like to resubmit that PR
> (seems to have some conflicts with the current DAGScheduler)
>
> My suggestion is to make it as an option in accumulator, e.g. some
> algorithms utilizing accumulator for result calculation, it needs a
> deterministic accumulator, while others implementing something like Hadoop
> counters may need the current implementation (count everything happened,
> including the duplications)
>
> Your thoughts?
>
> --
> Nan Zhu
>
> On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote:
>
> Hmm, good point, this seems to have been broken by refactorings of the
> scheduler, but it worked in the past. Basically the solution is simple --
> in a result stage, we should not apply the update for each task ID more
> than once -- the same way we don't call job.listener.taskSucceeded more
> than once. Your PR also tried to avoid this for resubmitted shuffle stages,
> but I don't think we need to do that necessarily (though we could).
>
> Matei
>
> On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com)
> wrote:
>
> Hi, Matei,
>
> Can you give some hint on how the current implementation guarantee the
> accumulator is only applied for once?
>
> There is a pending PR trying to achieving this (
> https://github.com/apache/spark/pull/228/files), but from the current
> implementation, I didn’t see this has been done? (maybe I missed something)
>
> Best,
>
> --
> Nan Zhu
>
> On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:
>
>  Hey Sandy,
>
> On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com)
> wrote:
>
> Hey All,
>
> A couple questions came up about shared variables recently, and I wanted
> to
> confirm my understanding and update the doc to be a little more clear.
>
> *Broadcast variables*
> Now that tasks data is automatically broadcast, the only occasions where
> it
> makes sense to explicitly broadcast are:
> * You want to use a variable from tasks in multiple stages.
> * You want to have the variable stored on the executors in deserialized
> form.
> * You want tasks to be able to modify the variable and have those
> modifications take effect for other tasks running on the same executor
> (usually a very bad idea).
>
> Is that right?
> Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also
> matters. (We might later factor tasks in a different way to avoid 2, but
> it's hard due to things like Hadoop JobConf objects in the tasks).
>
>
> *Accumulators*
> Values are only counted for successful tasks. Is that right? KMeans seems
> to use it in this way. What happens if a node goes away and successful
> tasks need to be resubmitted? Or the stage runs again because a different
> job needed it.
> Accumulators are guaranteed to give a deterministic result if you only
> increment them in actions. For each result stage, the accumulator's update
> from each task is only applied once, even if that task runs multiple times.
> If you use accumulators in transformations (i.e. in a stage that may be
> part of multiple jobs), then you may see multiple updates, from each run.
> This is kind of confusing but it was useful for people who wanted to use
> these for debugging.
>
> Matei
>
>
>
>
>
> thanks,
> Sandy
>
>
>
>


Re: hash vs sort shuffle

2014-09-22 Thread Sandy Ryza
Thanks for the heads up Cody.  Any indication of what was going wrong?

On Mon, Sep 22, 2014 at 7:16 AM, Cody Koeninger  wrote:

> Just as a heads up, we deployed 471e6a3a of master (in order to get some
> sql fixes), and were seeing jobs fail until we set
>
> spark.shuffle.manager=HASH
>
> I'd be reluctant to change the default to sort for the 1.1.1 release
>


hash vs sort shuffle

2014-09-22 Thread Cody Koeninger
Just as a heads up, we deployed 471e6a3a of master (in order to get some
sql fixes), and were seeing jobs fail until we set

spark.shuffle.manager=HASH

I'd be reluctant to change the default to sort for the 1.1.1 release


Re: A couple questions about shared variables

2014-09-22 Thread Nan Zhu
If you think it as necessary to fix, I would like to resubmit that PR (seems to 
have some conflicts with the current DAGScheduler)  

My suggestion is to make it as an option in accumulator, e.g. some algorithms 
utilizing accumulator for result calculation, it needs a deterministic 
accumulator, while others implementing something like Hadoop counters may need 
the current implementation (count everything happened, including the 
duplications)

Your thoughts?  

--  
Nan Zhu


On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote:

> Hmm, good point, this seems to have been broken by refactorings of the 
> scheduler, but it worked in the past. Basically the solution is simple -- in 
> a result stage, we should not apply the update for each task ID more than 
> once -- the same way we don't call job.listener.taskSucceeded more than once. 
> Your PR also tried to avoid this for resubmitted shuffle stages, but I don't 
> think we need to do that necessarily (though we could).
>  
> Matei  
>  
> On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com 
> (mailto:zhunanmcg...@gmail.com)) wrote:
>  
> > Hi, Matei,  
> >  
> > Can you give some hint on how the current implementation guarantee the 
> > accumulator is only applied for once?  
> >  
> > There is a pending PR trying to achieving this 
> > (https://github.com/apache/spark/pull/228/files), but from the current 
> > implementation, I didn’t see this has been done? (maybe I missed something) 
> >  
> >  
> > Best,  
> >  
> > --   
> > Nan Zhu
> >  
> >  
> > On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:
> >  
> > > Hey Sandy,
> > >  
> > > On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com 
> > > (mailto:sandy.r...@cloudera.com)) wrote:  
> > >  
> > > Hey All,   
> > >  
> > > A couple questions came up about shared variables recently, and I wanted 
> > > to   
> > > confirm my understanding and update the doc to be a little more clear.  
> > >  
> > > *Broadcast variables*   
> > > Now that tasks data is automatically broadcast, the only occasions where 
> > > it  
> > > makes sense to explicitly broadcast are:  
> > > * You want to use a variable from tasks in multiple stages.  
> > > * You want to have the variable stored on the executors in deserialized  
> > > form.  
> > > * You want tasks to be able to modify the variable and have those  
> > > modifications take effect for other tasks running on the same executor  
> > > (usually a very bad idea).  
> > >  
> > > Is that right?   
> > > Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also 
> > > matters. (We might later factor tasks in a different way to avoid 2, but 
> > > it's hard due to things like Hadoop JobConf objects in the tasks).
> > >  
> > >  
> > > *Accumulators*   
> > > Values are only counted for successful tasks. Is that right? KMeans seems 
> > >  
> > > to use it in this way. What happens if a node goes away and successful  
> > > tasks need to be resubmitted? Or the stage runs again because a different 
> > >  
> > > job needed it.  
> > > Accumulators are guaranteed to give a deterministic result if you only 
> > > increment them in actions. For each result stage, the accumulator's 
> > > update from each task is only applied once, even if that task runs 
> > > multiple times. If you use accumulators in transformations (i.e. in a 
> > > stage that may be part of multiple jobs), then you may see multiple 
> > > updates, from each run. This is kind of confusing but it was useful for 
> > > people who wanted to use these for debugging.
> > >  
> > > Matei  
> > >  
> > >  
> > >  
> > >  
> > >  
> > > thanks,   
> > > Sandy  
> > >  
> > >  
> > >  
> >  
> >  



Re: Dependency hell in Spark applications

2014-09-22 Thread Aniket Bhatnagar
I have submitted a defect in JIRA for this:
https://issues.apache.org/jira/browse/SPARK-3638 and have submitted a PR (
https://github.com/apache/spark/pull/2489) that temporarily fixes the
issue. Users would have to build spark with kinesis-asl to get the
compatible httpclient added to spark assembly jar.

On 22 September 2014 15:00, 이인규(inQ)  wrote:

> Hello,
>
> In my case, I manually deleted org/apache/http directory in the
> spark-assembly jar file..
> I think if we use the latest version of httpclient (httpcore) library, we
> can resolve the problem.
> How about upgrading httpclient? (or jets3t?)
>
> 2014-09-11 19:09 GMT+09:00 Aniket Bhatnagar :
>
>> Thanks everyone for weighing in on this.
>>
>> I had backported kinesis module from master to spark 1.0.2 so just to
>> confirm if I am not missing anything, I did a dependency graph compare of
>> my spark build with spark-master
>> and org.apache.httpcomponents:httpclient:jar does seem to resolve to 4.1.2
>> dependency.
>>
>> I need Hive so, I can't really do a build without it. Even if I
>> exclude httpclient
>> dependency from my project's build, it will not solve the problem because
>> AWS SDK has been compiled with a greater version of http client. My spark
>> stream project does not uses http client directly. AWS SDK will look for
>>  class org.apache.http.impl.conn.DefaultClientConnectionOperator and it
>> will be loaded from spark-assembly jar regardless of how I package my
>> project (unless I am missing something?). I enabled verbosed classloading
>> to confirm that the class is indeed loading from spark-assembly jar.
>>
>> spark.files.userClassPathFirst option doesn't seem to be working on my
>> spark 1.0.2 build (not sure why).
>>
>> I was only left custom building spark and forcingly introduce latest
>> httpclient's latest version as dependency.
>>
>> Finally, I tested this on 1.1.0-RC4 today and it has the same issue. Has
>> anyone ever been able to get the Kinesis example work with spark-hadoop2.4
>> (with hive and yarn) build? I feel like this is a bug that exists even in
>> 1.1.0.
>>
>> I still believe we need a better solution to address the dependency hell
>> problem. If OSGi is deemed too over the top, what are the solutions being
>> investigated?
>>
>> On 6 September 2014 04:44, Ted Yu  wrote:
>>
>> > From output of dependency:tree:
>> >
>> > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>> > spark-streaming_2.10 ---
>> > [INFO] org.apache.spark:spark-streaming_2.10:jar:1.1.0-SNAPSHOT
>> > INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0-SNAPSHOT:compile
>> > [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.4.0:compile
>> > ...
>> > [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.9.0:compile
>> > [INFO] |  |  +- commons-codec:commons-codec:jar:1.5:compile
>> > [INFO] |  |  +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
>> > [INFO] |  |  +- org.apache.httpcomponents:httpcore:jar:4.1.2:compile
>> >
>> > bq. excluding httpclient from spark-streaming dependency in your
>> > sbt/maven project
>> >
>> > This should work.
>> >
>> >
>> > On Fri, Sep 5, 2014 at 3:14 PM, Tathagata Das <
>> tathagata.das1...@gmail.com
>> > > wrote:
>> >
>> >> If httpClient dependency is coming from Hive, you could build Spark
>> >> without
>> >> Hive. Alternatively, have you tried excluding httpclient from
>> >> spark-streaming dependency in your sbt/maven project?
>> >>
>> >> TD
>> >>
>> >>
>> >>
>> >> On Thu, Sep 4, 2014 at 6:42 AM, Koert Kuipers 
>> wrote:
>> >>
>> >> > custom spark builds should not be the answer. at least not if spark
>> ever
>> >> > wants to have a vibrant community for spark apps.
>> >> >
>> >> > spark does support a user-classpath-first option, which would deal
>> with
>> >> > some of these issues, but I don't think it works.
>> >> > On Sep 4, 2014 9:01 AM, "Felix Garcia Borrego" 
>> >> wrote:
>> >> >
>> >> > > Hi,
>> >> > > I run into the same issue and apart from the ideas Aniket said, I
>> only
>> >> > > could find a nasty workaround. Add my custom
>> >> > PoolingClientConnectionManager
>> >> > > to my classpath.
>> >> > >
>> >> > >
>> >> > >
>> >> >
>> >>
>> http://stackoverflow.com/questions/24788949/nosuchmethoderror-while-running-aws-s3-client-on-spark-while-javap-shows-otherwi/25488955#25488955
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Thu, Sep 4, 2014 at 11:43 AM, Sean Owen 
>> >> wrote:
>> >> > >
>> >> > > > Dumb question -- are you using a Spark build that includes the
>> >> Kinesis
>> >> > > > dependency? that build would have resolved conflicts like this
>> for
>> >> > > > you. Your app would need to use the same version of the Kinesis
>> >> client
>> >> > > > SDK, ideally.
>> >> > > >
>> >> > > > All of these ideas are well-known, yes. In cases of super-common
>> >> > > > dependencies like Guava, they are already shaded. This is a
>> >> > > > less-common source of conflicts so I don't think http-client is
>> >> > > > shaded, especially since it is not used directly by Spark. I
>> think
>> >> > >

Re: BlockManager issues

2014-09-22 Thread Andrew Ash
Another data point on the 1.1.0 FetchFailures:

Running this SQL command works on 1.0.2 but fails on 1.1.0 due to the
exceptions mentioned earlier in this thread: "SELECT stringCol,
SUM(doubleCol) FROM parquetTable GROUP BY stringCol"

The FetchFailure exception has the remote block manager that failed to
produce the shuffle.  I enabled GC logging and repeated, and the
CoarseGrainedExecutorBackend JVM is just pounding in full GCs:

943.047: [Full GC [PSYoungGen: 5708288K->5536188K(8105472K)] [ParOldGen:
20971043K->20971202K(20971520K)] 26679331K->26507390K(29076992K)
[PSPermGen: 52897K->52897K(57344K)], 48.4514680 secs] [Times: user=602.38
sys=4.43, real=48.44 secs]
991.591: [Full GC [PSYoungGen: 5708288K->5591884K(8105472K)] [ParOldGen:
20971202K->20971044K(20971520K)] 26679490K->26562928K(29076992K)
[PSPermGen: 52897K->52897K(56832K)], 51.8109380 secs] [Times: user=645.44
sys=5.03, real=51.81 secs]
1043.431: [Full GC [PSYoungGen: 5708288K->5606238K(8105472K)] [ParOldGen:
20971044K->20971100K(20971520K)] 26679332K->26577339K(29076992K)
[PSPermGen: 52908K->52908K(56320K)], 85.9367800 secs] [Times: user=1074.29
sys=9.49, real=85.92 secs]
1129.419: [Full GC [PSYoungGen: 5708288K->5634246K(8105472K)] [ParOldGen:
20971100K->20971471K(20971520K)] 26679388K->26605717K(29076992K)
[PSPermGen: 52912K->52912K(55808K)], 52.2114100 secs] [Times: user=652.29
sys=4.94, real=52.21 secs]
1181.671: [Full GC [PSYoungGen: 5708288K->5656389K(8105472K)] [ParOldGen:
20971471K->20971125K(20971520K)] 26679759K->26627514K(29076992K)
[PSPermGen: 52961K->52961K(55296K)], 65.3284620 secs] [Times: user=818.58
sys=6.71, real=65.31 secs]
1247.034: [Full GC [PSYoungGen: 5708288K->5672356K(8105472K)] [ParOldGen:
20971125K->20971417K(20971520K)] 26679413K->26643774K(29076992K)
[PSPermGen: 52982K->52982K(54784K)], 91.2656940 secs] [Times: user=1146.94
sys=9.83, real=91.25 secs]
1338.318: [Full GC [PSYoungGen: 5708288K->5683177K(8105472K)] [ParOldGen:
20971417K->20971364K(20971520K)] 26679705K->26654541K(29076992K)
[PSPermGen: 52982K->52982K(54784K)], 68.9840690 secs] [Times: user=866.72
sys=7.31, real=68.97 secs]
1407.319: [Full GC [PSYoungGen: 5708288K->5691352K(8105472K)] [ParOldGen:
20971364K->20971041K(20971520K)] 26679652K->26662394K(29076992K)
[PSPermGen: 52985K->52985K(54272K)], 58.2522860 secs] [Times: user=724.33
sys=5.74, real=58.24 secs]
1465.572: [Full GC [PSYoungGen: 5708288K->5691382K(8105472K)] [ParOldGen:
20971041K->20971041K(20971520K)] 26679329K->26662424K(29076992K)
[PSPermGen: 52986K->52986K(54272K)], 17.8034740 secs] [Times: user=221.43
sys=0.72, real=17.80 secs]
1483.377: [Full GC [PSYoungGen: 5708288K->5691383K(8105472K)] [ParOldGen:
20971041K->20971041K(20971520K)] 26679329K->26662424K(29076992K)
[PSPermGen: 52987K->52987K(54272K)], 64.3194300 secs] [Times: user=800.32
sys=6.65, real=64.31 secs]
1547.700: [Full GC [PSYoungGen: 5708288K->5692228K(8105472K)] [ParOldGen:
20971041K->20971029K(20971520K)] 26679329K->26663257K(29076992K)
[PSPermGen: 52991K->52991K(53760K)], 54.8107170 secs] [Times: user=681.07
sys=5.41, real=54.80 secs]
1602.519: [Full GC [PSYoungGen: 5708288K->5695801K(8105472K)] [ParOldGen:
20971029K->20971401K(20971520K)] 26679317K->26667203K(29076992K)
[PSPermGen: 52993K->52993K(53760K)], 71.7970690 secs] [Times: user=896.22
sys=7.61, real=71.79 secs]



I repeated the job, this time taking jmap -histos as it went along.  The
last histo I was able to get before the JVM locked up (getting a histo on a
JVM in GC storms is very difficult) is here:

 num #instances #bytes  class name
--
   1:  31437598 2779681704  [B
   2:  62794123 1507058952  scala.collection.immutable.$colon$colon
   3:  31387645 1506606960
 org.apache.spark.sql.catalyst.expressions.Cast
   4:  31387645 1506606960
 org.apache.spark.sql.catalyst.expressions.SumFunction
   5:  31387645 1255505800
 org.apache.spark.sql.catalyst.expressions.Literal
   6:  31387645 1255505800
 org.apache.spark.sql.catalyst.expressions.Coalesce
   7:  31387645 1255505800
 org.apache.spark.sql.catalyst.expressions.MutableLiteral
   8:  31387645 1255505800
 org.apache.spark.sql.catalyst.expressions.Add
   9:  31391224 1004519168  java.util.HashMap$Entry
  10:  31402978  756090664  [Ljava.lang.Object;
  11:  31395785  753498840  java.lang.Double
  12:  31387645  753303480
 [Lorg.apache.spark.sql.catalyst.expressions.AggregateFunction;
  13:  31395808  502332928
 org.apache.spark.sql.catalyst.expressions.GenericRow
  14:  31387645  502202320
 org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5
  15:   772  234947960  [Ljava.util.HashMap$Entry;
  16:   711  106309792  [I
  17:106747   13673936  
  18:106747   12942856  
  19:  81869037880  
  20:  81868085000  
  21:2224945339856  scala

FW: Spark SQL 1.1.0: NPE when join two cached table

2014-09-22 Thread Haopu Wang
FWD to dev mail list for helps

 



From: Haopu Wang 
Sent: 2014年9月22日 16:35
To: u...@spark.apache.org
Subject: Spark SQL 1.1.0: NPE when join two cached table

 

I have two data sets and want to join them on each first field. Sample data are 
below:

 

data set 1:

  id2,name1,2,300.0

 

data set 2:

  id1,

 

The code is something like below:

 

val sparkConf = new SparkConf().setAppName("JoinInScala")

val sc = new SparkContext(sparkConf)

val sqlContext = new SQLContext(sc)

sqlContext.setConf("spark.sql.inMemoryColumnarStorage.compressed", "true")

import org.apache.spark.sql._   



val testdata = sc.textFile(args(0) + "data.txt").map(_.split(","))

  .map(p => Row(p(0), p(1).trim, p(2).trim.toLong, p(3).trim.toDouble))

  

val fields = new Array[StructField](4)

fields(0) = StructField("id", StringType, false);

fields(1) = StructField("name", StringType, false);

fields(2) = StructField("agg1", LongType, false);

fields(3) = StructField("agg2", DoubleType, false);

val schema = StructType(fields);

 

val data = sqlContext.applySchema(testdata, schema)



data.registerTempTable("datatable")

sqlContext.cacheTable("datatable")

 

val refdata = sc.textFile(args(0) + "ref.txt").map(_.split(","))

  .map(p => Row(p(0), p(1).trim))

  

val reffields = new Array[StructField](2)

reffields(0) = StructField("id", StringType, false);

reffields(1) = StructField("data", StringType, true);

val refschema = StructType(reffields);

 

val refschemardd = sqlContext.applySchema(refdata, refschema)

refschemardd.registerTempTable("ref")

sqlContext.cacheTable("ref")



   val results = sqlContext.sql("SELECT d.id,d.name,d.agg1,d.agg2,ref.data FROM 
datatable as d join ref on d.id=ref.id")

results.foreach(T => Unit);

 

But I got below NullPointerException. If I comment out the two "cacheTable()" 
calls, the program run well. Please shed some lights, thank you!

 

Exception in thread "main" java.lang.NullPointerException

at 
org.apache.spark.sql.columnar.InMemoryRelation.statistics$lzycompute(InMemoryColumnarTableScan.scala:43)

at 
org.apache.spark.sql.columnar.InMemoryRelation.statistics(InMemoryColumnarTableScan.scala:42)

at 
org.apache.spark.sql.execution.SparkStrategies$HashJoin$.apply(SparkStrategies.scala:83)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)

at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)

at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:268)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)

at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)

at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)

at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)

at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)

at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:409)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:409)

at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:120)

at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)

at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:189)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.dependencies(RDD.scala:189)

at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1233)

at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:117)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)

at org.apache.spark.rdd.RDD.foreach(RDD.scala:759)

at Join$$anonfun$main$1.apply$mcVI$sp(Join.scala:44)

at scala

Re: BlockManager issues

2014-09-22 Thread David Rowe
I've run into this with large shuffles - I assumed that there was
contention between the shuffle output files and the JVM for memory.
Whenever we start getting these fetch failures, it corresponds with high
load on the machines the blocks are being fetched from, and in some cases
complete unresponsiveness (no ssh etc). Setting the timeout higher, or the
JVM heap lower (as a percentage of total machine memory) seemed to help..



On Mon, Sep 22, 2014 at 8:02 PM, Christoph Sawade <
christoph.saw...@googlemail.com> wrote:

> Hey all. We had also the same problem described by Nishkam almost in the
> same big data setting. We fixed the fetch failure by increasing the timeout
> for acks in the driver:
>
> set("spark.core.connection.ack.wait.timeout", "600") // 10 minutes timeout
> for acks between nodes
>
> Cheers, Christoph
>
> 2014-09-22 9:24 GMT+02:00 Hortonworks :
>
> > Actually I met similar issue when doing groupByKey and then count if the
> > shuffle size is big e.g. 1tb.
> >
> > Thanks.
> >
> > Zhan Zhang
> >
> > Sent from my iPhone
> >
> > > On Sep 21, 2014, at 10:56 PM, Nishkam Ravi  wrote:
> > >
> > > Thanks for the quick follow up Reynold and Patrick. Tried a run with
> > > significantly higher ulimit, doesn't seem to help. The executors have
> > 35GB
> > > each. Btw, with a recent version of the branch, the error message is
> > "fetch
> > > failures" as opposed to "too many open files". Not sure if they are
> > > related.  Please note that the workload runs fine with head set to
> > 066765d.
> > > In case you want to reproduce the problem: I'm running slightly
> modified
> > > ScalaPageRank (with KryoSerializer and persistence level
> > > memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster.
> > >
> > > Thanks,
> > > Nishkam
> > >
> > > On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell 
> > > wrote:
> > >
> > >> Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
> > >> that you are just having more spilling as a result of the patch and so
> > >> the filesystem is opening more files. I would try increasing the
> > >> ulimit.
> > >>
> > >> How much memory do your executors have?
> > >>
> > >> - Patrick
> > >>
> > >> On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell  >
> > >> wrote:
> > >>> Hey the numbers you mentioned don't quite line up - did you mean PR
> > 2711?
> > >>>
> > >>> On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin 
> > >> wrote:
> >  It seems like you just need to raise the ulimit?
> > 
> > 
> >  On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi 
> > >> wrote:
> > 
> > > Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one
> of
> > >> the
> > > workloads. Tried tracing the problem through change set analysis.
> > Looks
> > > like the offending commit is 4fde28c from Aug 4th for PR1707.
> Please
> > >> see
> > > SPARK-3633 for more details.
> > >
> > > Thanks,
> > > Nishkam
> > >>
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>


Re: BlockManager issues

2014-09-22 Thread Christoph Sawade
Hey all. We had also the same problem described by Nishkam almost in the
same big data setting. We fixed the fetch failure by increasing the timeout
for acks in the driver:

set("spark.core.connection.ack.wait.timeout", "600") // 10 minutes timeout
for acks between nodes

Cheers, Christoph

2014-09-22 9:24 GMT+02:00 Hortonworks :

> Actually I met similar issue when doing groupByKey and then count if the
> shuffle size is big e.g. 1tb.
>
> Thanks.
>
> Zhan Zhang
>
> Sent from my iPhone
>
> > On Sep 21, 2014, at 10:56 PM, Nishkam Ravi  wrote:
> >
> > Thanks for the quick follow up Reynold and Patrick. Tried a run with
> > significantly higher ulimit, doesn't seem to help. The executors have
> 35GB
> > each. Btw, with a recent version of the branch, the error message is
> "fetch
> > failures" as opposed to "too many open files". Not sure if they are
> > related.  Please note that the workload runs fine with head set to
> 066765d.
> > In case you want to reproduce the problem: I'm running slightly modified
> > ScalaPageRank (with KryoSerializer and persistence level
> > memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster.
> >
> > Thanks,
> > Nishkam
> >
> > On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell 
> > wrote:
> >
> >> Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
> >> that you are just having more spilling as a result of the patch and so
> >> the filesystem is opening more files. I would try increasing the
> >> ulimit.
> >>
> >> How much memory do your executors have?
> >>
> >> - Patrick
> >>
> >> On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell 
> >> wrote:
> >>> Hey the numbers you mentioned don't quite line up - did you mean PR
> 2711?
> >>>
> >>> On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin 
> >> wrote:
>  It seems like you just need to raise the ulimit?
> 
> 
>  On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi 
> >> wrote:
> 
> > Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of
> >> the
> > workloads. Tried tracing the problem through change set analysis.
> Looks
> > like the offending commit is 4fde28c from Aug 4th for PR1707. Please
> >> see
> > SPARK-3633 for more details.
> >
> > Thanks,
> > Nishkam
> >>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Dependency hell in Spark applications

2014-09-22 Thread inQ
Hello,

In my case, I manually deleted org/apache/http directory in the
spark-assembly jar file..
I think if we use the latest version of httpclient (httpcore) library, we
can resolve the problem.
How about upgrading httpclient? (or jets3t?)

2014-09-11 19:09 GMT+09:00 Aniket Bhatnagar :

> Thanks everyone for weighing in on this.
>
> I had backported kinesis module from master to spark 1.0.2 so just to
> confirm if I am not missing anything, I did a dependency graph compare of
> my spark build with spark-master
> and org.apache.httpcomponents:httpclient:jar does seem to resolve to 4.1.2
> dependency.
>
> I need Hive so, I can't really do a build without it. Even if I
> exclude httpclient
> dependency from my project's build, it will not solve the problem because
> AWS SDK has been compiled with a greater version of http client. My spark
> stream project does not uses http client directly. AWS SDK will look for
>  class org.apache.http.impl.conn.DefaultClientConnectionOperator and it
> will be loaded from spark-assembly jar regardless of how I package my
> project (unless I am missing something?). I enabled verbosed classloading
> to confirm that the class is indeed loading from spark-assembly jar.
>
> spark.files.userClassPathFirst option doesn't seem to be working on my
> spark 1.0.2 build (not sure why).
>
> I was only left custom building spark and forcingly introduce latest
> httpclient's latest version as dependency.
>
> Finally, I tested this on 1.1.0-RC4 today and it has the same issue. Has
> anyone ever been able to get the Kinesis example work with spark-hadoop2.4
> (with hive and yarn) build? I feel like this is a bug that exists even in
> 1.1.0.
>
> I still believe we need a better solution to address the dependency hell
> problem. If OSGi is deemed too over the top, what are the solutions being
> investigated?
>
> On 6 September 2014 04:44, Ted Yu  wrote:
>
> > From output of dependency:tree:
> >
> > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> > spark-streaming_2.10 ---
> > [INFO] org.apache.spark:spark-streaming_2.10:jar:1.1.0-SNAPSHOT
> > INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0-SNAPSHOT:compile
> > [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.4.0:compile
> > ...
> > [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.9.0:compile
> > [INFO] |  |  +- commons-codec:commons-codec:jar:1.5:compile
> > [INFO] |  |  +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
> > [INFO] |  |  +- org.apache.httpcomponents:httpcore:jar:4.1.2:compile
> >
> > bq. excluding httpclient from spark-streaming dependency in your
> > sbt/maven project
> >
> > This should work.
> >
> >
> > On Fri, Sep 5, 2014 at 3:14 PM, Tathagata Das <
> tathagata.das1...@gmail.com
> > > wrote:
> >
> >> If httpClient dependency is coming from Hive, you could build Spark
> >> without
> >> Hive. Alternatively, have you tried excluding httpclient from
> >> spark-streaming dependency in your sbt/maven project?
> >>
> >> TD
> >>
> >>
> >>
> >> On Thu, Sep 4, 2014 at 6:42 AM, Koert Kuipers 
> wrote:
> >>
> >> > custom spark builds should not be the answer. at least not if spark
> ever
> >> > wants to have a vibrant community for spark apps.
> >> >
> >> > spark does support a user-classpath-first option, which would deal
> with
> >> > some of these issues, but I don't think it works.
> >> > On Sep 4, 2014 9:01 AM, "Felix Garcia Borrego" 
> >> wrote:
> >> >
> >> > > Hi,
> >> > > I run into the same issue and apart from the ideas Aniket said, I
> only
> >> > > could find a nasty workaround. Add my custom
> >> > PoolingClientConnectionManager
> >> > > to my classpath.
> >> > >
> >> > >
> >> > >
> >> >
> >>
> http://stackoverflow.com/questions/24788949/nosuchmethoderror-while-running-aws-s3-client-on-spark-while-javap-shows-otherwi/25488955#25488955
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Sep 4, 2014 at 11:43 AM, Sean Owen 
> >> wrote:
> >> > >
> >> > > > Dumb question -- are you using a Spark build that includes the
> >> Kinesis
> >> > > > dependency? that build would have resolved conflicts like this for
> >> > > > you. Your app would need to use the same version of the Kinesis
> >> client
> >> > > > SDK, ideally.
> >> > > >
> >> > > > All of these ideas are well-known, yes. In cases of super-common
> >> > > > dependencies like Guava, they are already shaded. This is a
> >> > > > less-common source of conflicts so I don't think http-client is
> >> > > > shaded, especially since it is not used directly by Spark. I think
> >> > > > this is a case of your app conflicting with a third-party
> >> dependency?
> >> > > >
> >> > > > I think OSGi is deemed too over the top for things like this.
> >> > > >
> >> > > > On Thu, Sep 4, 2014 at 11:35 AM, Aniket Bhatnagar
> >> > > >  wrote:
> >> > > > > I am trying to use Kinesis as source to Spark Streaming and have
> >> run
> >> > > > into a
> >> > > > > dependency issue that can't be resolved without making my own
> >> custom
> >> > > > Spark
> >> > > > > build. The i

Re: BlockManager issues

2014-09-22 Thread Hortonworks
Actually I met similar issue when doing groupByKey and then count if the 
shuffle size is big e.g. 1tb.

Thanks.

Zhan Zhang

Sent from my iPhone

> On Sep 21, 2014, at 10:56 PM, Nishkam Ravi  wrote:
> 
> Thanks for the quick follow up Reynold and Patrick. Tried a run with
> significantly higher ulimit, doesn't seem to help. The executors have 35GB
> each. Btw, with a recent version of the branch, the error message is "fetch
> failures" as opposed to "too many open files". Not sure if they are
> related.  Please note that the workload runs fine with head set to 066765d.
> In case you want to reproduce the problem: I'm running slightly modified
> ScalaPageRank (with KryoSerializer and persistence level
> memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster.
> 
> Thanks,
> Nishkam
> 
> On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell 
> wrote:
> 
>> Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
>> that you are just having more spilling as a result of the patch and so
>> the filesystem is opening more files. I would try increasing the
>> ulimit.
>> 
>> How much memory do your executors have?
>> 
>> - Patrick
>> 
>> On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell 
>> wrote:
>>> Hey the numbers you mentioned don't quite line up - did you mean PR 2711?
>>> 
>>> On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin 
>> wrote:
 It seems like you just need to raise the ulimit?
 
 
 On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi 
>> wrote:
 
> Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of
>> the
> workloads. Tried tracing the problem through change set analysis. Looks
> like the offending commit is 4fde28c from Aug 4th for PR1707. Please
>> see
> SPARK-3633 for more details.
> 
> Thanks,
> Nishkam
>> 

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org