Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-25 Thread Ted Yu
HADOOP-10456 is fixed in hadoop 2.4.1

Does this mean that synchronization
on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for hadoop
2.4.1 ?

Cheers


On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell  wrote:

> The most important issue in this release is actually an ammendment to
> an earlier fix. The original fix caused a deadlock which was a
> regression from 1.0.0->1.0.1:
>
> Issue:
> https://issues.apache.org/jira/browse/SPARK-1097
>
> 1.0.1 Fix:
> https://github.com/apache/spark/pull/1273/files (had a deadlock)
>
> 1.0.2 Fix:
> https://github.com/apache/spark/pull/1409/files
>
> I failed to correctly label this on JIRA, but I've updated it!
>
> On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
>  wrote:
> > That query is looking at "Fix Version" not "Target Version".  The fact
> that
> > the first one is still open is only because the bug is not resolved in
> > master.  It is fixed in 1.0.2.  The second one is partially fixed in
> 1.0.2,
> > but is not worth blocking the release for.
> >
> >
> > On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
> > nicholas.cham...@gmail.com> wrote:
> >
> >> TD, there are a couple of unresolved issues slated for 1.0.2
> >> <
> >>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
> >> >.
> >> Should they be edited somehow?
> >>
> >>
> >> On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
> >> tathagata.das1...@gmail.com>
> >> wrote:
> >>
> >> > Please vote on releasing the following candidate as Apache Spark
> version
> >> > 1.0.2.
> >> >
> >> > This release fixes a number of bugs in Spark 1.0.1.
> >> > Some of the notable ones are
> >> > - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
> >> > SPARK-1199. The fix was reverted for 1.0.2.
> >> > - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
> >> > HDFS CSV file.
> >> > The full list is at http://s.apache.org/9NJ
> >> >
> >> > The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
> >> >
> >> >
> >>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
> >> >
> >> > The release files, including signatures, digests, etc can be found at:
> >> > http://people.apache.org/~tdas/spark-1.0.2-rc1/
> >> >
> >> > Release artifacts are signed with the following key:
> >> > https://people.apache.org/keys/committer/tdas.asc
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1024/
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> > http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
> >> >
> >> > Please vote on releasing this package as Apache Spark 1.0.2!
> >> >
> >> > The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
> >> > a majority of at least 3 +1 PMC votes are cast.
> >> > [ ] +1 Release this package as Apache Spark 1.0.2
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see
> >> > http://spark.apache.org/
> >> >
> >>
>


Re: GraphX graph partitioning strategy

2014-07-25 Thread Larry Xiao

On 7/26/14, 4:03 AM, Ankur Dave wrote:

Oops, the code should be:

val unpartitionedGraph: Graph[Int, Int] = ...val numPartitions: Int = 128
def getTripletPartition(e: EdgeTriplet[Int, Int]): PartitionID = ...
// Get the triplets using GraphX, then use Spark to repartition
themval partitionedEdges = unpartitionedGraph.triplets
   .map(e => (getTripletPartition(e), e))
   .partitionBy(new HashPartitioner(numPartitions))
   *.map(pair => Edge(pair._2.srcId, pair._2.dstId, pair._2.attr))*
val partitionedGraph = Graph(unpartitionedGraph.vertices, partitionedEdges)


Ankur 


Hi Ankur,

Thanks for clear advice!

I tested the 4 partitioning algorithm in GraphX, and implemented two others.
And I find EdgePartition2D performs the best.
(the two other algorithms performs only tiny bit better on graphs that 
are highly skewed or bipartite)


Larry


Re: Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
Maybe this is me misunderstanding the Spark system property behavior, but
I'm not clear why the class being loaded ends up having '/' rather than '.'
in it's fully qualified name.  When I tested this out locally, the '/' were
preventing the class from being loaded.


On Fri, Jul 25, 2014 at 2:27 PM, Gary Malouf  wrote:

> After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going
> well.  Looking at the Mesos slave logs, I noticed:
>
> ERROR KryoSerializer: Failed to run spark.kryo.registrator
> java.lang.ClassNotFoundException:
> com/mediacrossing/verrazano/kryo/MxDataRegistrator
>
> My spark-env.sh has the following when I run the Spark Shell:
>
> export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
>
> export MASTER=mesos://zk://n-01:2181,n-02:2181,n-03:2181/masters
>
> export ADD_JARS=/opt/spark/mx-lib/verrazano-assembly.jar
>
>
> # -XX:+UseCompressedOops must be disabled to use more than 32GB RAM
>
> SPARK_JAVA_OPTS="-Xss2m -XX:+UseCompressedOops
> -Dspark.local.dir=/opt/mesos-tmp -Dspark.executor.memory=4g
>  -Dspark.serializer=org.apache.spark.serializer.KryoSerializer
> -Dspark.kryo.registrator=com.mediacrossing.verrazano.kryo.MxDataRegistrator
> -Dspark.kryoserializer.buffer.mb=16 -Dspark.akka.askTimeout=30"
>
>
> I was able to verify that our custom jar was being copied to each worker,
> but for some reason it is not finding my registrator class.  Is anyone else
> struggling with Kryo on 1.0.x branch?
>


Re: Suggestion for SPARK-1825

2014-07-25 Thread Patrick Wendell
Yeah I agree reflection is the best solution. Whenever we do
reflection we should clearly document in the code which YARN API
version corresponds to which code path. I'm guessing since YARN is
adding new features... we'll just have to do this over time.

- Patrick

On Fri, Jul 25, 2014 at 3:35 PM, Reynold Xin  wrote:
> Actually reflection is probably a better, lighter weight process for this.
> An extra project brings more overhead for something simple.
>
>
>
>
>
> On Fri, Jul 25, 2014 at 3:09 PM, Colin McCabe 
> wrote:
>
>> So, I'm leaning more towards using reflection for this.  Maven profiles
>> could work, but it's tough since we have new stuff coming in in 2.4, 2.5,
>> etc.  and the number of profiles will multiply quickly if we have to do it
>> that way.  Reflection is the approach HBase took in a similar situation.
>>
>> best,
>> Colin
>>
>>
>> On Fri, Jul 25, 2014 at 11:23 AM, Colin McCabe 
>> wrote:
>>
>> > I have a similar issue with SPARK-1767.  There are basically three ways
>> to
>> > resolve the issue:
>> >
>> > 1. Use reflection to access classes newer than 0.21 (or whatever the
>> > oldest version of Hadoop is that Spark supports)
>> > 2. Add a build variant (in Maven this would be a profile) that deals with
>> > this.
>> > 3. Auto-detect which classes are available and use those.
>> >
>> > #1 is the easiest for end-users, but it can lead to some ugly code.
>> >
>> > #2 makes the code look nicer, but requires some effort on the part of
>> > people building spark.  This can also lead to headaches for IDEs, if
>> people
>> > don't remember to select the new profile.  (For example, in IntelliJ, you
>> > can't see any of the yarn classes when you import the project from Maven
>> > without the YARN profile selected.)
>> >
>> > #3 is something that... I don't know how to do in sbt or Maven.  I've
>> been
>> > told that an antrun task might work here, but it seems like it could get
>> > really tricky.
>> >
>> > Overall, I'd lean more towards #2 here.
>> >
>> > best,
>> > Colin
>> >
>> >
>> > On Tue, Jul 22, 2014 at 12:47 AM, innowireless TaeYun Kim <
>> > taeyun@innowireless.co.kr> wrote:
>> >
>> >> (I'm resending this mail since it seems that it was not sent. Sorry if
>> >> this
>> >> was already sent.)
>> >>
>> >> Hi,
>> >>
>> >>
>> >>
>> >> A couple of month ago, I made a pull request to fix
>> >> https://issues.apache.org/jira/browse/SPARK-1825.
>> >>
>> >> My pull request is here: https://github.com/apache/spark/pull/899
>> >>
>> >>
>> >>
>> >> But that pull request has problems:
>> >>
>> >> l  It is Hadoop 2.4.0+ only. It won't compile on the versions below it.
>> >>
>> >> l  The related Hadoop API is marked as '@Unstable'.
>> >>
>> >>
>> >>
>> >> Here is an idea to remedy the problems: a new Spark configuration
>> >> variable.
>> >>
>> >> Maybe it can be named as "spark.yarn.submit.crossplatform".
>> >>
>> >> If it is set to "true"(default is false), the related Spark code can use
>> >> the
>> >> hard-coded strings that is the same as the Hadoop API provides, thus
>> >> avoiding compile error on the Hadoop versions below 2.4.0.
>> >>
>> >>
>> >>
>> >> Can someone implement this feature, if this idea is acceptable?
>> >>
>> >> Currently my knowledge on Spark source code and Scala is limited to
>> >> implement it myself.
>> >>
>> >> To the right person, the modification should be trivial.
>> >>
>> >> You can refer to the source code changes of my pull request.
>> >>
>> >>
>> >>
>> >> Thanks.
>> >>
>> >>
>> >>
>> >>
>> >
>>


Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-25 Thread Nicholas Chammas
OK, thanks for the clarification.

2014년 7월 25일 금요일, Michael Armbrust님이 작성한 메시지:

> That query is looking at "Fix Version" not "Target Version".  The fact that
> the first one is still open is only because the bug is not resolved in
> master.  It is fixed in 1.0.2.  The second one is partially fixed in 1.0.2,
> but is not worth blocking the release for.
>
>
> On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com > wrote:
>
> > TD, there are a couple of unresolved issues slated for 1.0.2
> > <
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
> > >.
> > Should they be edited somehow?
> >
> >
> > On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
> > tathagata.das1...@gmail.com >
> > wrote:
> >
> > > Please vote on releasing the following candidate as Apache Spark
> version
> > > 1.0.2.
> > >
> > > This release fixes a number of bugs in Spark 1.0.1.
> > > Some of the notable ones are
> > > - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
> > > SPARK-1199. The fix was reverted for 1.0.2.
> > > - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
> > > HDFS CSV file.
> > > The full list is at http://s.apache.org/9NJ
> > >
> > > The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
> > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
> > >
> > > The release files, including signatures, digests, etc can be found at:
> > > http://people.apache.org/~tdas/spark-1.0.2-rc1/
> > >
> > > Release artifacts are signed with the following key:
> > > https://people.apache.org/keys/committer/tdas.asc
> > >
> > > The staging repository for this release can be found at:
> > >
> https://repository.apache.org/content/repositories/orgapachespark-1024/
> > >
> > > The documentation corresponding to this release can be found at:
> > > http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
> > >
> > > Please vote on releasing this package as Apache Spark 1.0.2!
> > >
> > > The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
> > > a majority of at least 3 +1 PMC votes are cast.
> > > [ ] +1 Release this package as Apache Spark 1.0.2
> > > [ ] -1 Do not release this package because ...
> > >
> > > To learn more about Apache Spark, please see
> > > http://spark.apache.org/
> > >
> >
>


Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-25 Thread Patrick Wendell
The most important issue in this release is actually an ammendment to
an earlier fix. The original fix caused a deadlock which was a
regression from 1.0.0->1.0.1:

Issue:
https://issues.apache.org/jira/browse/SPARK-1097

1.0.1 Fix:
https://github.com/apache/spark/pull/1273/files (had a deadlock)

1.0.2 Fix:
https://github.com/apache/spark/pull/1409/files

I failed to correctly label this on JIRA, but I've updated it!

On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
 wrote:
> That query is looking at "Fix Version" not "Target Version".  The fact that
> the first one is still open is only because the bug is not resolved in
> master.  It is fixed in 1.0.2.  The second one is partially fixed in 1.0.2,
> but is not worth blocking the release for.
>
>
> On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> TD, there are a couple of unresolved issues slated for 1.0.2
>> <
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
>> >.
>> Should they be edited somehow?
>>
>>
>> On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
>> tathagata.das1...@gmail.com>
>> wrote:
>>
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.0.2.
>> >
>> > This release fixes a number of bugs in Spark 1.0.1.
>> > Some of the notable ones are
>> > - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
>> > SPARK-1199. The fix was reverted for 1.0.2.
>> > - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
>> > HDFS CSV file.
>> > The full list is at http://s.apache.org/9NJ
>> >
>> > The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
>> >
>> >
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
>> >
>> > The release files, including signatures, digests, etc can be found at:
>> > http://people.apache.org/~tdas/spark-1.0.2-rc1/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/tdas.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1024/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
>> >
>> > Please vote on releasing this package as Apache Spark 1.0.2!
>> >
>> > The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
>> > a majority of at least 3 +1 PMC votes are cast.
>> > [ ] +1 Release this package as Apache Spark 1.0.2
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see
>> > http://spark.apache.org/
>> >
>>


Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-25 Thread Michael Armbrust
That query is looking at "Fix Version" not "Target Version".  The fact that
the first one is still open is only because the bug is not resolved in
master.  It is fixed in 1.0.2.  The second one is partially fixed in 1.0.2,
but is not worth blocking the release for.


On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> TD, there are a couple of unresolved issues slated for 1.0.2
> <
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
> >.
> Should they be edited somehow?
>
>
> On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
> tathagata.das1...@gmail.com>
> wrote:
>
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.0.2.
> >
> > This release fixes a number of bugs in Spark 1.0.1.
> > Some of the notable ones are
> > - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
> > SPARK-1199. The fix was reverted for 1.0.2.
> > - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
> > HDFS CSV file.
> > The full list is at http://s.apache.org/9NJ
> >
> > The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
> >
> > The release files, including signatures, digests, etc can be found at:
> > http://people.apache.org/~tdas/spark-1.0.2-rc1/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/tdas.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1024/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
> >
> > Please vote on releasing this package as Apache Spark 1.0.2!
> >
> > The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
> > a majority of at least 3 +1 PMC votes are cast.
> > [ ] +1 Release this package as Apache Spark 1.0.2
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
>


Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-25 Thread Nicholas Chammas
TD, there are a couple of unresolved issues slated for 1.0.2
.
Should they be edited somehow?


On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.0.2.
>
> This release fixes a number of bugs in Spark 1.0.1.
> Some of the notable ones are
> - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
> SPARK-1199. The fix was reverted for 1.0.2.
> - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
> HDFS CSV file.
> The full list is at http://s.apache.org/9NJ
>
> The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
>
> The release files, including signatures, digests, etc can be found at:
> http://people.apache.org/~tdas/spark-1.0.2-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/tdas.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1024/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.0.2!
>
> The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
> [ ] +1 Release this package as Apache Spark 1.0.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>


[VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-25 Thread Tathagata Das
Please vote on releasing the following candidate as Apache Spark version 1.0.2.

This release fixes a number of bugs in Spark 1.0.1.
Some of the notable ones are
- SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
SPARK-1199. The fix was reverted for 1.0.2.
- SPARK-2576: NoClassDefFoundError when executing Spark QL query on
HDFS CSV file.
The full list is at http://s.apache.org/9NJ

The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f

The release files, including signatures, digests, etc can be found at:
http://people.apache.org/~tdas/spark-1.0.2-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1024/

The documentation corresponding to this release can be found at:
http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/

Please vote on releasing this package as Apache Spark 1.0.2!

The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.0.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/


Re: Suggestion for SPARK-1825

2014-07-25 Thread Reynold Xin
Actually reflection is probably a better, lighter weight process for this.
An extra project brings more overhead for something simple.





On Fri, Jul 25, 2014 at 3:09 PM, Colin McCabe 
wrote:

> So, I'm leaning more towards using reflection for this.  Maven profiles
> could work, but it's tough since we have new stuff coming in in 2.4, 2.5,
> etc.  and the number of profiles will multiply quickly if we have to do it
> that way.  Reflection is the approach HBase took in a similar situation.
>
> best,
> Colin
>
>
> On Fri, Jul 25, 2014 at 11:23 AM, Colin McCabe 
> wrote:
>
> > I have a similar issue with SPARK-1767.  There are basically three ways
> to
> > resolve the issue:
> >
> > 1. Use reflection to access classes newer than 0.21 (or whatever the
> > oldest version of Hadoop is that Spark supports)
> > 2. Add a build variant (in Maven this would be a profile) that deals with
> > this.
> > 3. Auto-detect which classes are available and use those.
> >
> > #1 is the easiest for end-users, but it can lead to some ugly code.
> >
> > #2 makes the code look nicer, but requires some effort on the part of
> > people building spark.  This can also lead to headaches for IDEs, if
> people
> > don't remember to select the new profile.  (For example, in IntelliJ, you
> > can't see any of the yarn classes when you import the project from Maven
> > without the YARN profile selected.)
> >
> > #3 is something that... I don't know how to do in sbt or Maven.  I've
> been
> > told that an antrun task might work here, but it seems like it could get
> > really tricky.
> >
> > Overall, I'd lean more towards #2 here.
> >
> > best,
> > Colin
> >
> >
> > On Tue, Jul 22, 2014 at 12:47 AM, innowireless TaeYun Kim <
> > taeyun@innowireless.co.kr> wrote:
> >
> >> (I'm resending this mail since it seems that it was not sent. Sorry if
> >> this
> >> was already sent.)
> >>
> >> Hi,
> >>
> >>
> >>
> >> A couple of month ago, I made a pull request to fix
> >> https://issues.apache.org/jira/browse/SPARK-1825.
> >>
> >> My pull request is here: https://github.com/apache/spark/pull/899
> >>
> >>
> >>
> >> But that pull request has problems:
> >>
> >> l  It is Hadoop 2.4.0+ only. It won't compile on the versions below it.
> >>
> >> l  The related Hadoop API is marked as '@Unstable'.
> >>
> >>
> >>
> >> Here is an idea to remedy the problems: a new Spark configuration
> >> variable.
> >>
> >> Maybe it can be named as "spark.yarn.submit.crossplatform".
> >>
> >> If it is set to "true"(default is false), the related Spark code can use
> >> the
> >> hard-coded strings that is the same as the Hadoop API provides, thus
> >> avoiding compile error on the Hadoop versions below 2.4.0.
> >>
> >>
> >>
> >> Can someone implement this feature, if this idea is acceptable?
> >>
> >> Currently my knowledge on Spark source code and Scala is limited to
> >> implement it myself.
> >>
> >> To the right person, the modification should be trivial.
> >>
> >> You can refer to the source code changes of my pull request.
> >>
> >>
> >>
> >> Thanks.
> >>
> >>
> >>
> >>
> >
>


Re: Suggestion for SPARK-1825

2014-07-25 Thread Colin McCabe
So, I'm leaning more towards using reflection for this.  Maven profiles
could work, but it's tough since we have new stuff coming in in 2.4, 2.5,
etc.  and the number of profiles will multiply quickly if we have to do it
that way.  Reflection is the approach HBase took in a similar situation.

best,
Colin


On Fri, Jul 25, 2014 at 11:23 AM, Colin McCabe 
wrote:

> I have a similar issue with SPARK-1767.  There are basically three ways to
> resolve the issue:
>
> 1. Use reflection to access classes newer than 0.21 (or whatever the
> oldest version of Hadoop is that Spark supports)
> 2. Add a build variant (in Maven this would be a profile) that deals with
> this.
> 3. Auto-detect which classes are available and use those.
>
> #1 is the easiest for end-users, but it can lead to some ugly code.
>
> #2 makes the code look nicer, but requires some effort on the part of
> people building spark.  This can also lead to headaches for IDEs, if people
> don't remember to select the new profile.  (For example, in IntelliJ, you
> can't see any of the yarn classes when you import the project from Maven
> without the YARN profile selected.)
>
> #3 is something that... I don't know how to do in sbt or Maven.  I've been
> told that an antrun task might work here, but it seems like it could get
> really tricky.
>
> Overall, I'd lean more towards #2 here.
>
> best,
> Colin
>
>
> On Tue, Jul 22, 2014 at 12:47 AM, innowireless TaeYun Kim <
> taeyun@innowireless.co.kr> wrote:
>
>> (I'm resending this mail since it seems that it was not sent. Sorry if
>> this
>> was already sent.)
>>
>> Hi,
>>
>>
>>
>> A couple of month ago, I made a pull request to fix
>> https://issues.apache.org/jira/browse/SPARK-1825.
>>
>> My pull request is here: https://github.com/apache/spark/pull/899
>>
>>
>>
>> But that pull request has problems:
>>
>> l  It is Hadoop 2.4.0+ only. It won't compile on the versions below it.
>>
>> l  The related Hadoop API is marked as '@Unstable'.
>>
>>
>>
>> Here is an idea to remedy the problems: a new Spark configuration
>> variable.
>>
>> Maybe it can be named as "spark.yarn.submit.crossplatform".
>>
>> If it is set to "true"(default is false), the related Spark code can use
>> the
>> hard-coded strings that is the same as the Hadoop API provides, thus
>> avoiding compile error on the Hadoop versions below 2.4.0.
>>
>>
>>
>> Can someone implement this feature, if this idea is acceptable?
>>
>> Currently my knowledge on Spark source code and Scala is limited to
>> implement it myself.
>>
>> To the right person, the modification should be trivial.
>>
>> You can refer to the source code changes of my pull request.
>>
>>
>>
>> Thanks.
>>
>>
>>
>>
>


Re: GraphX graph partitioning strategy

2014-07-25 Thread Ankur Dave
Oops, the code should be:

val unpartitionedGraph: Graph[Int, Int] = ...val numPartitions: Int = 128
def getTripletPartition(e: EdgeTriplet[Int, Int]): PartitionID = ...
// Get the triplets using GraphX, then use Spark to repartition
themval partitionedEdges = unpartitionedGraph.triplets
  .map(e => (getTripletPartition(e), e))
  .partitionBy(new HashPartitioner(numPartitions))
  *.map(pair => Edge(pair._2.srcId, pair._2.dstId, pair._2.attr))*
val partitionedGraph = Graph(unpartitionedGraph.vertices, partitionedEdges)


Ankur 


Re: GraphX graph partitioning strategy

2014-07-25 Thread Ankur Dave
Hi Larry,

GraphX's graph constructor leaves the edges in their original partitions by
default. To support arbitrary multipass graph partitioning, one idea is to
take advantage of that by partitioning the graph externally to GraphX
(though possibly using information from GraphX such as the degrees), then
pass the partitioned edges to GraphX.

For example, if you had an edge partitioning function that needed the full
triplet to assign a partition, you could do this as follows:

val unpartitionedGraph: Graph[Int, Int] = ...val numPartitions: Int = 128
def getTripletPartition(e: EdgeTriplet[Int, Int]): PartitionID = ...
// Get the triplets using GraphX, then use Spark to repartition
themval partitionedEdges = unpartitionedGraph.triplets
  .map(e => (getTripletPartition(e), e))
  .partitionBy(new HashPartitioner(numPartitions))
val partitionedGraph = Graph(unpartitionedGraph.vertices, partitionedEdges)


A multipass partitioning algorithm could store its results in the edge
attribute, and then you could use the code above to do the partitioning.

Ankur 


On Wed, Jul 23, 2014 at 11:59 PM, Larry Xiao  wrote:

> Hi all,
>
> I'm implementing graph partitioning strategy for GraphX, learning from
> researches on graph computing.
>
> I have two questions:
>
> - a specific implement question:
> In current design, only vertex ID of src and dst are provided
> (PartitionStrategy.scala).
> And some strategies require knowledge about the graph (like degrees) and
> can consist more than one passes to finally produce the partition ID.
> So I'm changing the PartitionStrategy.getPartition API to provide more
> info, but I don't want to make it complex. (the current one looks very
> clean)
>
> - an open question:
> What advice would you give considering partitioning, considering the
> procedure Spark adopt on graph processing?
>
> Any advice is much appreciated.
>
> Best Regards,
> Larry Xiao
>
> Reference
>
> Bipartite-oriented Distributed Graph Partitioning for Big Learning.
> PowerLyra : Differentiated Graph Computation and Partitioning on Skewed
> Graphs
>


Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going
well.  Looking at the Mesos slave logs, I noticed:

ERROR KryoSerializer: Failed to run spark.kryo.registrator
java.lang.ClassNotFoundException:
com/mediacrossing/verrazano/kryo/MxDataRegistrator

My spark-env.sh has the following when I run the Spark Shell:

export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so

export MASTER=mesos://zk://n-01:2181,n-02:2181,n-03:2181/masters

export ADD_JARS=/opt/spark/mx-lib/verrazano-assembly.jar


# -XX:+UseCompressedOops must be disabled to use more than 32GB RAM

SPARK_JAVA_OPTS="-Xss2m -XX:+UseCompressedOops
-Dspark.local.dir=/opt/mesos-tmp -Dspark.executor.memory=4g
 -Dspark.serializer=org.apache.spark.serializer.KryoSerializer
-Dspark.kryo.registrator=com.mediacrossing.verrazano.kryo.MxDataRegistrator
-Dspark.kryoserializer.buffer.mb=16 -Dspark.akka.askTimeout=30"


I was able to verify that our custom jar was being copied to each worker,
but for some reason it is not finding my registrator class.  Is anyone else
struggling with Kryo on 1.0.x branch?


Re: Suggestion for SPARK-1825

2014-07-25 Thread Colin McCabe
I have a similar issue with SPARK-1767.  There are basically three ways to
resolve the issue:

1. Use reflection to access classes newer than 0.21 (or whatever the oldest
version of Hadoop is that Spark supports)
2. Add a build variant (in Maven this would be a profile) that deals with
this.
3. Auto-detect which classes are available and use those.

#1 is the easiest for end-users, but it can lead to some ugly code.

#2 makes the code look nicer, but requires some effort on the part of
people building spark.  This can also lead to headaches for IDEs, if people
don't remember to select the new profile.  (For example, in IntelliJ, you
can't see any of the yarn classes when you import the project from Maven
without the YARN profile selected.)

#3 is something that... I don't know how to do in sbt or Maven.  I've been
told that an antrun task might work here, but it seems like it could get
really tricky.

Overall, I'd lean more towards #2 here.

best,
Colin


On Tue, Jul 22, 2014 at 12:47 AM, innowireless TaeYun Kim <
taeyun@innowireless.co.kr> wrote:

> (I'm resending this mail since it seems that it was not sent. Sorry if this
> was already sent.)
>
> Hi,
>
>
>
> A couple of month ago, I made a pull request to fix
> https://issues.apache.org/jira/browse/SPARK-1825.
>
> My pull request is here: https://github.com/apache/spark/pull/899
>
>
>
> But that pull request has problems:
>
> l  It is Hadoop 2.4.0+ only. It won't compile on the versions below it.
>
> l  The related Hadoop API is marked as '@Unstable'.
>
>
>
> Here is an idea to remedy the problems: a new Spark configuration variable.
>
> Maybe it can be named as "spark.yarn.submit.crossplatform".
>
> If it is set to "true"(default is false), the related Spark code can use
> the
> hard-coded strings that is the same as the Hadoop API provides, thus
> avoiding compile error on the Hadoop versions below 2.4.0.
>
>
>
> Can someone implement this feature, if this idea is acceptable?
>
> Currently my knowledge on Spark source code and Scala is limited to
> implement it myself.
>
> To the right person, the modification should be trivial.
>
> You can refer to the source code changes of my pull request.
>
>
>
> Thanks.
>
>
>
>


Re: Configuring Spark Memory

2014-07-25 Thread John Omernik
SO this is good information for standalone, but how is memory distributed
within Mesos?  There's coarse grain mode where the execute stays active, or
theres fine grained mode where it appears each task is it's only process in
mesos, how to memory allocations work in these cases? Thanks!



On Thu, Jul 24, 2014 at 12:14 PM, Martin Goodson 
wrote:

> Great - thanks for the clarification Aaron. The offer stands for me to
> write some documentation and an example that covers this without leaving
> *any* room for ambiguity.
>
>
>
>
> --
> Martin Goodson  |  VP Data Science
> (0)20 3397 1240
> [image: Inline image 1]
>
>
> On Thu, Jul 24, 2014 at 6:09 PM, Aaron Davidson 
> wrote:
>
>> Whoops, I was mistaken in my original post last year. By default, there
>> is one executor per node per Spark Context, as you said.
>> "spark.executor.memory" is the amount of memory that the application
>> requests for each of its executors. SPARK_WORKER_MEMORY is the amount of
>> memory a Spark Worker is willing to allocate in executors.
>>
>> So if you were to set SPARK_WORKER_MEMORY to 8g everywhere on your
>> cluster, and spark.executor.memory to 4g, you would be able to run 2
>> simultaneous Spark Contexts who get 4g per node. Similarly, if
>> spark.executor.memory were 8g, you could only run 1 Spark Context at a time
>> on the cluster, but it would get all the cluster's memory.
>>
>>
>> On Thu, Jul 24, 2014 at 7:25 AM, Martin Goodson 
>> wrote:
>>
>>> Thank you Nishkam,
>>> I have read your code. So, for the sake of my understanding, it seems
>>> that for each spark context there is one executor per node? Can anyone
>>> confirm this?
>>>
>>>
>>> --
>>> Martin Goodson  |  VP Data Science
>>> (0)20 3397 1240
>>> [image: Inline image 1]
>>>
>>>
>>> On Thu, Jul 24, 2014 at 6:12 AM, Nishkam Ravi 
>>> wrote:
>>>
 See if this helps:

 https://github.com/nishkamravi2/SparkAutoConfig/

 It's a very simple tool for auto-configuring default parameters in
 Spark. Takes as input high-level parameters (like number of nodes, cores
 per node, memory per node, etc) and spits out default configuration, user
 advice and command line. Compile (javac SparkConfigure.java) and run (java
 SparkConfigure).

 Also cc'ing dev in case others are interested in helping evolve this
 over time (by refining the heuristics and adding more parameters).


  On Wed, Jul 23, 2014 at 8:31 AM, Martin Goodson 
 wrote:

> Thanks Andrew,
>
> So if there is only one SparkContext there is only one executor per
> machine? This seems to contradict Aaron's message from the link above:
>
> "If each machine has 16 GB of RAM and 4 cores, for example, you might
> set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by
> Spark.)"
>
> Am I reading this incorrectly?
>
> Anyway our configuration is 21 machines (one master and 20 slaves)
> each with 60Gb. We would like to use 4 cores per machine. This is pyspark
> so we want to leave say 16Gb on each machine for python processes.
>
> Thanks again for the advice!
>
>
>
> --
> Martin Goodson  |  VP Data Science
> (0)20 3397 1240
> [image: Inline image 1]
>
>
> On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash 
> wrote:
>
>> Hi Martin,
>>
>> In standalone mode, each SparkContext you initialize gets its own set
>> of executors across the cluster.  So for example if you have two shells
>> open, they'll each get two JVMs on each worker machine in the cluster.
>>
>> As far as the other docs, you can configure the total number of cores
>> requested for the SparkContext, the amount of memory for the executor JVM
>> on each machine, the amount of memory for the Master/Worker daemons 
>> (little
>> needed since work is done in executors), and several other settings.
>>
>> Which of those are you interested in?  What spec hardware do you have
>> and how do you want to configure it?
>>
>> Andrew
>>
>>
>> On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson > > wrote:
>>
>>> We are having difficulties configuring Spark, partly because we
>>> still don't understand some key concepts. For instance, how many 
>>> executors
>>> are there per machine in standalone mode? This is after having
>>> closely read the documentation several times:
>>>
>>> *http://spark.apache.org/docs/latest/configuration.html
>>> *
>>> *http://spark.apache.org/docs/latest/spark-standalone.html
>>> *
>>> *http://spark.apache.org/docs/latest/tuning.html
>>> *
>>> *http://spark.apache.org/docs/latest/cluster-overview.html
>>> *
>>>