[jira] [Resolved] (SPARK-4660) JavaSerializer uses wrong classloader

2015-01-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4660.

   Resolution: Fixed
Fix Version/s: 1.2.1
   1.1.2
   1.3.0

> JavaSerializer uses wrong classloader
> -
>
> Key: SPARK-4660
> URL: https://issues.apache.org/jira/browse/SPARK-4660
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Piotr Kołaczkowski
>Priority: Critical
> Fix For: 1.3.0, 1.1.2, 1.2.1
>
> Attachments: spark-serializer-classloader.patch
>
>
> During testing we found failures when trying to load some classes of the user 
> application:
> {noformat}
> ERROR 2014-11-29 20:01:56 org.apache.spark.storage.BlockManagerWorker: 
> Exception handling buffer message
> java.lang.ClassNotFoundException: 
> org.apache.spark.demo.HttpReceiverCases$HttpRequest
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.serializer.JavaDeseriali
> zationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>   at 
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
>   at 
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126)
>   at 
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:104)
>   at org.apache.spark.storage.MemoryStore.putBytes(MemoryStore.scala:76)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:748)
>   at 
> org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:639)
>   at 
> org.apache.spark.storage.BlockManagerWorker.putBlock(BlockManagerWorker.scala:92)
>   at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:73)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at 
> org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
>   at 
> org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:38)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:38)
>   at 
> org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:682)
>   at 
> org.apache.spark.network.ConnectionManager$$anon$10.run(ConnectionManager.scala:520)
>   at 
> java.util.concurrent.ThreadPoo

[jira] [Resolved] (SPARK-5270) Provide isEmpty() function in RDD API

2015-01-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5270.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Provide isEmpty() function in RDD API
> -
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 1.3.0
>
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5270) Provide isEmpty() function in RDD API

2015-01-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5270:
---
Summary: Provide isEmpty() function in RDD API  (was: Provide isEmpty 
utility function in RDD API)

> Provide isEmpty() function in RDD API
> -
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Assignee: Sean Owen
>Priority: Trivial
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5270) Provide isEmpty utility function in RDD API

2015-01-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5270:
---
Assignee: Sean Owen

> Provide isEmpty utility function in RDD API
> ---
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Assignee: Sean Owen
>Priority: Trivial
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5270) Provide isEmpty utility function in RDD API

2015-01-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5270:
---
Summary: Provide isEmpty utility function in RDD API  (was: Elegantly check 
if RDD is empty)

> Provide isEmpty utility function in RDD API
> ---
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Priority: Trivial
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5297) File Streams do not work with custom key/values

2015-01-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5297:
---
Description: 
The following code:
{code}
stream_context.>fileStream(directory)
.foreachRDD(new Function,Void>() {
 public Void call ( JavaPairRDD rdd ) throws Exception {
 for ( Tuple2 x: rdd.collect() )
 System.out.println("# "+x._1+" "+x._2);
 return null;
 }
  });
stream_context.start();
stream_context.awaitTermination();
{code}
for custom (serializable) classes K and V compiles fine but gives an error
when I drop a new hadoop sequence file in the directory:
{quote}
15/01/17 09:13:59 ERROR scheduler.JobScheduler: Error generating jobs for time 
1421507639000 ms
java.lang.ClassCastException: java.lang.Object cannot be cast to 
org.apache.hadoop.mapreduce.InputFormat
at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:91)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at 
org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:236)
at 
org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:234)
at 
org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:128)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:296)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288)
at scala.Option.orElse(Option.scala:257)
{quote}
The same classes K and V work fine for non-streaming Spark:
{code}
spark_context.newAPIHadoopFile(path,F.class,K.class,SequenceFileInputFormat.class,conf)
{code}
also streaming works fine for TextFileInputFormat.

The issue is that class manifests are erased to object in the Java file stream 
constructor, but those are relied on downstream when creating the Hadoop RDD 
that backs each batch of the file stream.

https://github.com/apache/spark/blob/v1.2.0/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala#L263
https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/SparkContext.scala#L753


  was:
The following code:
{code}
stream_context.>fileStream(directory)
.foreachRDD(new Function,Void>() {
 public Void call ( JavaPairRDD rdd ) throws Exception {
 for ( Tuple2 x: rdd.collect() )
 System.out.println("# "+x._1+" "+x._2);
 return null;
 }
  });
stream_context.start();
stream_context.awaitTermination();
{code}
for custom (serializable) classes K and V compiles fine but gives an error
when I drop a new hadoop sequence file in the directory:
{quote}
15/01/17 09:13:59 ERROR scheduler.JobScheduler: Error generating jobs for time 
1421507639000 ms
java.lang.ClassCastException: java.lang.Object cannot be cast to 
org.apache.hadoop.mapreduce.InputFormat
at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:91)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at 
org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:236)
at 
org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.Abstr

Re: Semantics of LGTM

2015-01-19 Thread Patrick Wendell
The wiki does not seem to be operational ATM, but I will do this when
it is back up.

On Mon, Jan 19, 2015 at 12:00 PM, Patrick Wendell  wrote:
> Okay - so given all this I was going to put the following on the wiki
> tentatively:
>
> ## Reviewing Code
> Community code review is Spark's fundamental quality assurance
> process. When reviewing a patch, your goal should be to help
> streamline the committing process by giving committers confidence this
> patch has been verified by an additional party. It's encouraged to
> (politely) submit technical feedback to the author to identify areas
> for improvement or potential bugs.
>
> If you feel a patch is ready for inclusion in Spark, indicate this to
> committers with a comment: "I think this patch looks good". Spark uses
> the LGTM convention for indicating the highest level of technical
> sign-off on a patch: simply comment with the word "LGTM". An LGTM is a
> strong statement, it should be interpreted as the following: "I've
> looked at this thoroughly and take as much ownership as if I wrote the
> patch myself". If you comment LGTM you will be expected to help with
> bugs or follow-up issues on the patch. Judicious use of LGTM's is a
> great way to gain credibility as a reviewer with the broader
> community.
>
> It's also welcome for reviewers to argue against the inclusion of a
> feature or patch. Simply indicate this in the comments.
>
> - Patrick
>
> On Mon, Jan 19, 2015 at 2:40 AM, Prashant Sharma  wrote:
>> Patrick's original proposal LGTM :).  However until now, I have been in the
>> impression of LGTM with special emphasis on TM part. That said, I will be
>> okay/happy(or Responsible ) for the patch, if it goes in.
>>
>> Prashant Sharma
>>
>>
>>
>> On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin  wrote:
>>>
>>> Maybe just to avoid LGTM as a single token when it is not actually
>>> according to Patrick's definition, but anybody can still leave comments
>>> like:
>>>
>>> "The direction of the PR looks good to me." or "+1 on the direction"
>>>
>>> "The build part looks good to me"
>>>
>>> ...
>>>
>>>
>>> On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout 
>>> wrote:
>>>
>>> > +1 to Patrick's proposal of strong LGTM semantics.  On past projects,
>>> > I've
>>> > heard the semantics of "LGTM" expressed as "I've looked at this
>>> > thoroughly
>>> > and take as much ownership as if I wrote the patch myself".  My
>>> > understanding is that this is the level of review we expect for all
>>> > patches
>>> > that ultimately go into Spark, so it's important to have a way to
>>> > concisely
>>> > describe when this has been done.
>>> >
>>> > Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
>>> > cases I've seen, if someone else says "I looked at this very quickly and
>>> > didn't see any glaring problems", it doesn't add any value for
>>> > subsequent
>>> > reviewers (someone still needs to take a thorough look).
>>> >
>>> > -Kay
>>> >
>>> > On Sat, Jan 17, 2015 at 8:04 PM,  wrote:
>>> >
>>> > > Yeah, the ASF +1 has become partly overloaded to mean both "I would
>>> > > like
>>> > > to see this feature" and "this patch should be committed", although,
>>> > > at
>>> > > least in Hadoop, using +1 on JIRA (as opposed to, say, in a release
>>> > > vote)
>>> > > should unambiguously mean the latter unless qualified in some other
>>> > > way.
>>> > >
>>> > > I don't have any opinion on the specific characters, but I agree with
>>> > > Aaron that it would be nice to have some sort of abbreviation for both
>>> > the
>>> > > strong and weak forms of approval.
>>> > >
>>> > > -Sandy
>>> > >
>>> > > > On Jan 17, 2015, at 7:25 PM, Patrick Wendell 
>>> > wrote:
>>> > > >
>>> > > > I think the ASF +1 is *slightly* different than Google's LGTM,
>>> > > > because
>>> > > > it might convey wanting the patch/feature to be merged but not
>>> > > > necessarily saying you did a thorough review and stand behind

Re: Semantics of LGTM

2015-01-19 Thread Patrick Wendell
Okay - so given all this I was going to put the following on the wiki
tentatively:

## Reviewing Code
Community code review is Spark's fundamental quality assurance
process. When reviewing a patch, your goal should be to help
streamline the committing process by giving committers confidence this
patch has been verified by an additional party. It's encouraged to
(politely) submit technical feedback to the author to identify areas
for improvement or potential bugs.

If you feel a patch is ready for inclusion in Spark, indicate this to
committers with a comment: "I think this patch looks good". Spark uses
the LGTM convention for indicating the highest level of technical
sign-off on a patch: simply comment with the word "LGTM". An LGTM is a
strong statement, it should be interpreted as the following: "I've
looked at this thoroughly and take as much ownership as if I wrote the
patch myself". If you comment LGTM you will be expected to help with
bugs or follow-up issues on the patch. Judicious use of LGTM's is a
great way to gain credibility as a reviewer with the broader
community.

It's also welcome for reviewers to argue against the inclusion of a
feature or patch. Simply indicate this in the comments.

- Patrick

On Mon, Jan 19, 2015 at 2:40 AM, Prashant Sharma  wrote:
> Patrick's original proposal LGTM :).  However until now, I have been in the
> impression of LGTM with special emphasis on TM part. That said, I will be
> okay/happy(or Responsible ) for the patch, if it goes in.
>
> Prashant Sharma
>
>
>
> On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin  wrote:
>>
>> Maybe just to avoid LGTM as a single token when it is not actually
>> according to Patrick's definition, but anybody can still leave comments
>> like:
>>
>> "The direction of the PR looks good to me." or "+1 on the direction"
>>
>> "The build part looks good to me"
>>
>> ...
>>
>>
>> On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout 
>> wrote:
>>
>> > +1 to Patrick's proposal of strong LGTM semantics.  On past projects,
>> > I've
>> > heard the semantics of "LGTM" expressed as "I've looked at this
>> > thoroughly
>> > and take as much ownership as if I wrote the patch myself".  My
>> > understanding is that this is the level of review we expect for all
>> > patches
>> > that ultimately go into Spark, so it's important to have a way to
>> > concisely
>> > describe when this has been done.
>> >
>> > Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
>> > cases I've seen, if someone else says "I looked at this very quickly and
>> > didn't see any glaring problems", it doesn't add any value for
>> > subsequent
>> > reviewers (someone still needs to take a thorough look).
>> >
>> > -Kay
>> >
>> > On Sat, Jan 17, 2015 at 8:04 PM,  wrote:
>> >
>> > > Yeah, the ASF +1 has become partly overloaded to mean both "I would
>> > > like
>> > > to see this feature" and "this patch should be committed", although,
>> > > at
>> > > least in Hadoop, using +1 on JIRA (as opposed to, say, in a release
>> > > vote)
>> > > should unambiguously mean the latter unless qualified in some other
>> > > way.
>> > >
>> > > I don't have any opinion on the specific characters, but I agree with
>> > > Aaron that it would be nice to have some sort of abbreviation for both
>> > the
>> > > strong and weak forms of approval.
>> > >
>> > > -Sandy
>> > >
>> > > > On Jan 17, 2015, at 7:25 PM, Patrick Wendell 
>> > wrote:
>> > > >
>> > > > I think the ASF +1 is *slightly* different than Google's LGTM,
>> > > > because
>> > > > it might convey wanting the patch/feature to be merged but not
>> > > > necessarily saying you did a thorough review and stand behind it's
>> > > > technical contents. For instance, I've seen people pile on +1's to
>> > > > try
>> > > > and indicate support for a feature or patch in some projects, even
>> > > > though they didn't do a thorough technical review. This +1 is
>> > > > definitely a useful mechanism.
>> > > >
>> > > > There is definitely much overlap though in the meaning, though, and
>> > > > it's largely because Spark

[jira] [Resolved] (SPARK-5088) Use spark-class for running executors directly on mesos

2015-01-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5088.

Resolution: Fixed
  Assignee: Jongyoul Lee

> Use spark-class for running executors directly on mesos
> ---
>
> Key: SPARK-5088
> URL: https://issues.apache.org/jira/browse/SPARK-5088
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 1.2.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Minor
> Fix For: 1.3.0
>
>
> - sbin/spark-executor is only used by running executor on mesos environment.
> - spark-executor calls spark-class without specific parameter internally.
> - PYTHONPATH is moved to spark-class' case.
> - Remove a redundant file for maintaining codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4417) New API: sample RDD to fixed number of items

2015-01-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4417.

Resolution: Won't Fix
  Assignee: Ilya Ganelin

[~ilganeli] ended up taking a crack a this, but we decided not to include the 
feature based on follow up discussion in the PR.

> New API: sample RDD to fixed number of items
> 
>
> Key: SPARK-4417
> URL: https://issues.apache.org/jira/browse/SPARK-4417
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Reporter: Davies Liu
>Assignee: Ilya Ganelin
>
> Sometimes, we just want to a fixed number of items randomly selected from an 
> RDD, for example, before sort an RDD we need to gather a fixed number of keys 
> from each partitions.
> In order to do this, we need to two pass on the RDD, get the total number, 
> then calculate the right ratio for sampling. In fact, we could do this in one 
> pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception

2015-01-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2595.

Resolution: Won't Fix

Per PR comment, closing this for now.

> The driver run garbage collection, when the executor throws OutOfMemoryError 
> exception
> --
>
> Key: SPARK-2595
> URL: https://issues.apache.org/jira/browse/SPARK-2595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation 
> GC-based cleaning only consider the memory usage of the drive. We should 
> consider more factors to trigger gc. eg: executor exit code, task exception, 
> task gc time .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3758) Script style checking

2015-01-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3758.

Resolution: Won't Fix

This patch ended up being so large, I think we're gonna pass on it.

> Script style checking
> -
>
> Key: SPARK-3758
> URL: https://issues.apache.org/jira/browse/SPARK-3758
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>
> There are no way to check style of scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters

2015-01-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3288.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Target Version/s: 1.3.0  (was: 1.2.0)

> All fields in TaskMetrics should be private and use getters/setters
> ---
>
> Key: SPARK-3288
> URL: https://issues.apache.org/jira/browse/SPARK-3288
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>    Reporter: Patrick Wendell
>Assignee: Dale Richardson
>  Labels: starter
> Fix For: 1.3.0
>
>
> This is particularly bad because we expose this as a developer API. 
> Technically a library could create a TaskMetrics object and then change the 
> values inside of it and pass it onto someone else. It can be written pretty 
> compactly like below:
> {code}
>   /**
>* Number of bytes written for the shuffle by this task
>*/
>   @volatile private var _shuffleBytesWritten: Long = _
>   def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += 
> value
>   def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= 
> value
>   def shuffleBytesWritten = _shuffleBytesWritten
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters

2015-01-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3288:
---
Assignee: Ilya Ganelin  (was: Dale Richardson)

> All fields in TaskMetrics should be private and use getters/setters
> ---
>
> Key: SPARK-3288
> URL: https://issues.apache.org/jira/browse/SPARK-3288
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>    Reporter: Patrick Wendell
>Assignee: Ilya Ganelin
>  Labels: starter
> Fix For: 1.3.0
>
>
> This is particularly bad because we expose this as a developer API. 
> Technically a library could create a TaskMetrics object and then change the 
> values inside of it and pass it onto someone else. It can be written pretty 
> compactly like below:
> {code}
>   /**
>* Number of bytes written for the shuffle by this task
>*/
>   @volatile private var _shuffleBytesWritten: Long = _
>   def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += 
> value
>   def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= 
> value
>   def shuffleBytesWritten = _shuffleBytesWritten
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5217) Spark UI should report pending stages during job execution on AllStagesPage.

2015-01-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5217.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Spark UI should report pending stages during job execution on AllStagesPage.
> 
>
> Key: SPARK-5217
> URL: https://issues.apache.org/jira/browse/SPARK-5217
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 1.3.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
> Fix For: 1.3.0
>
> Attachments: pending_stages.png
>
>
> This is a first step. 
> Spark listener already reports all the stages at the time of job submission 
> and of which we only show active, failed and completed. This addition has no 
> overhead and seems straight forward to achieve.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5249) In SparkConf accept value with Any type and perform string conversion

2015-01-18 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5249:
---
Summary: In SparkConf accept value with Any type and perform string 
conversion  (was: Added setX functions to set a Boolean, Int, Float and Double 
parameters with a "specialized" function.)

> In SparkConf accept value with Any type and perform string conversion
> -
>
> Key: SPARK-5249
> URL: https://issues.apache.org/jira/browse/SPARK-5249
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Adam Gutglick
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5249) In SparkConf accept value with Any type and perform string conversion

2015-01-18 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282062#comment-14282062
 ] 

Patrick Wendell commented on SPARK-5249:


I also updated the title to reflect what the proposed patch actually did.

> In SparkConf accept value with Any type and perform string conversion
> -
>
> Key: SPARK-5249
> URL: https://issues.apache.org/jira/browse/SPARK-5249
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Adam Gutglick
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5249) Added setX functions to set a Boolean, Int, Float and Double parameters with a "specialized" function.

2015-01-18 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5249.

Resolution: Won't Fix

Per discussion on the issue we've decided to just ask users to convert to 
strings on their own.

> Added setX functions to set a Boolean, Int, Float and Double parameters with 
> a "specialized" function.
> --
>
> Key: SPARK-5249
> URL: https://issues.apache.org/jira/browse/SPARK-5249
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Adam Gutglick
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2015-01-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3694.

Resolution: Duplicate

> Allow printing object graph of tasks/RDD's with a debug flag
> 
>
> Key: SPARK-3694
> URL: https://issues.apache.org/jira/browse/SPARK-3694
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>    Reporter: Patrick Wendell
>Assignee: Ilya Ganelin
>  Labels: starter
>
> This would be useful for debugging extra references inside of RDD's
> Here is an example for inspiration:
> http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
> We'd want to print this trace for both the RDD serialization inside of the 
> DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Semantics of LGTM

2015-01-17 Thread Patrick Wendell
I think the ASF +1 is *slightly* different than Google's LGTM, because
it might convey wanting the patch/feature to be merged but not
necessarily saying you did a thorough review and stand behind it's
technical contents. For instance, I've seen people pile on +1's to try
and indicate support for a feature or patch in some projects, even
though they didn't do a thorough technical review. This +1 is
definitely a useful mechanism.

There is definitely much overlap though in the meaning, though, and
it's largely because Spark had it's own culture around reviews before
it was donated to the ASF, so there is a mix of two styles.

Nonetheless, I'd prefer to stick with the stronger LGTM semantics I
proposed originally (unlike the one Sandy proposed, e.g.). This is
what I've seen every project using the LGTM convention do (Google, and
some open source projects such as Impala) to indicate technical
sign-off.

- Patrick

On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson  wrote:
> I think I've seen something like +2 = "strong LGTM" and +1 = "weak LGTM;
> someone else should review" before. It's nice to have a shortcut which isn't
> a sentence when talking about weaker forms of LGTM.
>
> On Sat, Jan 17, 2015 at 6:59 PM,  wrote:
>>
>> I think clarifying these semantics is definitely worthwhile. Maybe this
>> complicates the process with additional terminology, but the way I've used
>> these has been:
>>
>> +1 - I think this is safe to merge and, barring objections from others,
>> would merge it immediately.
>>
>> LGTM - I have no concerns about this patch, but I don't necessarily feel
>> qualified to make a final call about it.  The TM part acknowledges the
>> judgment as a little more subjective.
>>
>> I think having some concise way to express both of these is useful.
>>
>> -Sandy
>>
>> > On Jan 17, 2015, at 5:40 PM, Patrick Wendell  wrote:
>> >
>> > Hey All,
>> >
>> > Just wanted to ping about a minor issue - but one that ends up having
>> > consequence given Spark's volume of reviews and commits. As much as
>> > possible, I think that we should try and gear towards "Google Style"
>> > LGTM on reviews. What I mean by this is that LGTM has the following
>> > semantics:
>> >
>> > "I know this code well, or I've looked at it close enough to feel
>> > confident it should be merged. If there are issues/bugs with this code
>> > later on, I feel confident I can help with them."
>> >
>> > Here is an alternative semantic:
>> >
>> > "Based on what I know about this part of the code, I don't see any
>> > show-stopper problems with this patch".
>> >
>> > The issue with the latter is that it ultimately erodes the
>> > significance of LGTM, since subsequent reviewers need to reason about
>> > what the person meant by saying LGTM. In contrast, having strong
>> > semantics around LGTM can help streamline reviews a lot, especially as
>> > reviewers get more experienced and gain trust from the comittership.
>> >
>> > There are several easy ways to give a more limited endorsement of a
>> > patch:
>> > - "I'm not familiar with this code, but style, etc look good" (general
>> > endorsement)
>> > - "The build changes in this code LGTM, but I haven't reviewed the
>> > rest" (limited LGTM)
>> >
>> > If people are okay with this, I might add a short note on the wiki.
>> > I'm sending this e-mail first, though, to see whether anyone wants to
>> > express agreement or disagreement with this approach.
>> >
>> > - Patrick
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Semantics of LGTM

2015-01-17 Thread Patrick Wendell
Hey All,

Just wanted to ping about a minor issue - but one that ends up having
consequence given Spark's volume of reviews and commits. As much as
possible, I think that we should try and gear towards "Google Style"
LGTM on reviews. What I mean by this is that LGTM has the following
semantics:

"I know this code well, or I've looked at it close enough to feel
confident it should be merged. If there are issues/bugs with this code
later on, I feel confident I can help with them."

Here is an alternative semantic:

"Based on what I know about this part of the code, I don't see any
show-stopper problems with this patch".

The issue with the latter is that it ultimately erodes the
significance of LGTM, since subsequent reviewers need to reason about
what the person meant by saying LGTM. In contrast, having strong
semantics around LGTM can help streamline reviews a lot, especially as
reviewers get more experienced and gain trust from the comittership.

There are several easy ways to give a more limited endorsement of a patch:
- "I'm not familiar with this code, but style, etc look good" (general
endorsement)
- "The build changes in this code LGTM, but I haven't reviewed the
rest" (limited LGTM)

If people are okay with this, I might add a short note on the wiki.
I'm sending this e-mail first, though, to see whether anyone wants to
express agreement or disagreement with this approach.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Resolved] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2

2015-01-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5289.

Resolution: Fixed

> Backport publishing of repl, yarn into branch-1.2
> -
>
> Key: SPARK-5289
> URL: https://issues.apache.org/jira/browse/SPARK-5289
> Project: Spark
>  Issue Type: Improvement
>    Reporter: Patrick Wendell
>        Assignee: Patrick Wendell
>Priority: Blocker
>
> In SPARK-3452 we did some clean-up of published artifacts that turned out to 
> adversely affect some users. This has been mostly patched up in master via 
> SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
> modules, they were fixed in SPARK-4048 as part of a larger change that only 
> went into master.
> Those pieces should be backported to Spark 1.2 to allow publishing in a 1.2.1 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5096) SparkBuild.scala assumes you are at the spark root dir

2015-01-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5096.

   Resolution: Fixed
Fix Version/s: 1.3.0

> SparkBuild.scala assumes you are at the spark root dir
> --
>
> Key: SPARK-5096
> URL: https://issues.apache.org/jira/browse/SPARK-5096
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.3.0
>
>
> This is bad because it breaks compiling spark as an external project ref and 
> is generally bad SBT practice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5096) SparkBuild.scala assumes you are at the spark root dir

2015-01-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5096:
---
Target Version/s:   (was: 1.0.3)

> SparkBuild.scala assumes you are at the spark root dir
> --
>
> Key: SPARK-5096
> URL: https://issues.apache.org/jira/browse/SPARK-5096
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.3.0
>
>
> This is bad because it breaks compiling spark as an external project ref and 
> is generally bad SBT practice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Bouncing Mails

2015-01-17 Thread Patrick Wendell
Akhil,

Those are handled by ASF infrastructure, not anyone in the Spark
project. So this list is not the appropriate place to ask for help.

- Patrick

On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das  wrote:
> My mails to the mailing list are getting rejected, have opened a Jira issue,
> can someone take a look at it?
>
> https://issues.apache.org/jira/browse/INFRA-9032
>
>
>
>
>
>
> Thanks
> Best Regards

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Bouncing Mails

2015-01-17 Thread Patrick Wendell
Akhil,

Those are handled by ASF infrastructure, not anyone in the Spark
project. So this list is not the appropriate place to ask for help.

- Patrick

On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das  wrote:
> My mails to the mailing list are getting rejected, have opened a Jira issue,
> can someone take a look at it?
>
> https://issues.apache.org/jira/browse/INFRA-9032
>
>
>
>
>
>
> Thanks
> Best Regards

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2

2015-01-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5289:
---
Description: 
In SPARK-3452 we did some clean-up of published artifacts that turned out to 
adversely affect some users. This has been mostly patched up in master via 
SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
modules, they were fixed in SPARK-4048 as part of a larger change that only 
went into master.

Those pieces should be backported to Spark 1.2 to allow publishing in a 1.2.1 
release.

  was:
In SPARK-3452 we did some clean-up of published artifacts that turned out to 
adversely affect some users. This has been mostly patched up in master via 
SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
modules, they were fixed in SPARK-4048 as part of a larger change that only 
went into master.

Those pieces should be backported.


> Backport publishing of repl, yarn into branch-1.2
> -
>
> Key: SPARK-5289
> URL: https://issues.apache.org/jira/browse/SPARK-5289
> Project: Spark
>  Issue Type: Improvement
>    Reporter: Patrick Wendell
>        Assignee: Patrick Wendell
>Priority: Blocker
>
> In SPARK-3452 we did some clean-up of published artifacts that turned out to 
> adversely affect some users. This has been mostly patched up in master via 
> SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
> modules, they were fixed in SPARK-4048 as part of a larger change that only 
> went into master.
> Those pieces should be backported to Spark 1.2 to allow publishing in a 1.2.1 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2

2015-01-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5289:
---
Description: In SPARK-3452 we did some clean-up of published artifacts that 
turned out to adversely affect some users. This has been mostly patched up in 
master via SPARK-4925 (hive-thritserver) which was backported. For the repl and 
yarn modules, they were fixed in SPARK-4048 as part of a larger change that 
only went into master.  (was: In SPARK-3452 we did some clean-up of published 
artifacts that turned out to adversely affect some users. This has been mostly 
patched up in master via SPARK-4925 (hive-thritserver), SPARK-4048 (which 
inadvertently did this for yarn and repl). But we should go in branch 1.2 and 
fix this as well so that we can do a 1.2.1 release with these artifacts.)

> Backport publishing of repl, yarn into branch-1.2
> -
>
> Key: SPARK-5289
> URL: https://issues.apache.org/jira/browse/SPARK-5289
> Project: Spark
>  Issue Type: Improvement
>    Reporter: Patrick Wendell
>        Assignee: Patrick Wendell
>Priority: Blocker
>
> In SPARK-3452 we did some clean-up of published artifacts that turned out to 
> adversely affect some users. This has been mostly patched up in master via 
> SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
> modules, they were fixed in SPARK-4048 as part of a larger change that only 
> went into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2

2015-01-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5289:
---
Summary: Backport publishing of repl, yarn into branch-1.2  (was: Backport 
publishing of repl, yarn, and hive-thriftserver into branch-1.2)

> Backport publishing of repl, yarn into branch-1.2
> -
>
> Key: SPARK-5289
> URL: https://issues.apache.org/jira/browse/SPARK-5289
> Project: Spark
>  Issue Type: Improvement
>    Reporter: Patrick Wendell
>        Assignee: Patrick Wendell
>Priority: Blocker
>
> In SPARK-3452 we did some clean-up of published artifacts that turned out to 
> adversely affect some users. This has been mostly patched up in master via 
> SPARK-4925 (hive-thritserver), SPARK-4048 (which inadvertently did this for 
> yarn and repl). But we should go in branch 1.2 and fix this as well so that 
> we can do a 1.2.1 release with these artifacts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2

2015-01-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5289:
---
Description: 
In SPARK-3452 we did some clean-up of published artifacts that turned out to 
adversely affect some users. This has been mostly patched up in master via 
SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
modules, they were fixed in SPARK-4048 as part of a larger change that only 
went into master.

Those pieces should be backported.

  was:In SPARK-3452 we did some clean-up of published artifacts that turned out 
to adversely affect some users. This has been mostly patched up in master via 
SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
modules, they were fixed in SPARK-4048 as part of a larger change that only 
went into master.


> Backport publishing of repl, yarn into branch-1.2
> -
>
> Key: SPARK-5289
> URL: https://issues.apache.org/jira/browse/SPARK-5289
> Project: Spark
>  Issue Type: Improvement
>    Reporter: Patrick Wendell
>        Assignee: Patrick Wendell
>Priority: Blocker
>
> In SPARK-3452 we did some clean-up of published artifacts that turned out to 
> adversely affect some users. This has been mostly patched up in master via 
> SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
> modules, they were fixed in SPARK-4048 as part of a larger change that only 
> went into master.
> Those pieces should be backported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5289) Backport publishing of repl, yarn, and hive-thriftserver into branch-1.2

2015-01-16 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-5289:
--

 Summary: Backport publishing of repl, yarn, and hive-thriftserver 
into branch-1.2
 Key: SPARK-5289
 URL: https://issues.apache.org/jira/browse/SPARK-5289
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker


In SPARK-3452 we did some clean-up of published artifacts that turned out to 
adversely affect some users. This has been mostly patched up in master via 
SPARK-4925 (hive-thritserver), SPARK-4048 (which inadvertently did this for 
yarn and repl). But we should go in branch 1.2 and fix this as well so that we 
can do a 1.2.1 release with these artifacts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5260:
---
Fix Version/s: (was: 1.3.0)

> Expose JsonRDD.allKeysWithValueTypes() in a utility class 
> --
>
> Key: SPARK-5260
> URL: https://issues.apache.org/jira/browse/SPARK-5260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Corey J. Nolet
>
> I have found this method extremely useful when implementing my own strategy 
> for inferring a schema from parsed json. For now, I've actually copied the 
> method right out of the JsonRDD class into my own project but I think it 
> would be immensely useful to keep the code in Spark and expose it publicly 
> somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5270) Elegantly check if RDD is empty

2015-01-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5270:
---
Target Version/s: 1.3.0

> Elegantly check if RDD is empty
> ---
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Priority: Trivial
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4357) Modify release publishing to work with Scala 2.11

2015-01-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4357.

Resolution: Fixed

Sorry this is actually working now. We now publish artifacts for Scala 2.11. It 
was fixed a while back.

> Modify release publishing to work with Scala 2.11
> -
>
> Key: SPARK-4357
> URL: https://issues.apache.org/jira/browse/SPARK-4357
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>        Reporter: Patrick Wendell
>    Assignee: Patrick Wendell
>
> We'll need to do some effort to make our publishing work with 2.11 since the 
> current pipeline assumes a single set of artifacts is published.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5176) Thrift server fails with confusing error message when deploy-mode is cluster

2015-01-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279869#comment-14279869
 ] 

Patrick Wendell edited comment on SPARK-5176 at 1/16/15 6:28 AM:
-

Yes, we should add a check here similar to the existing ones for the 
thriftserver class:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143

[~tpanning] are you interested in contributing this? If not, someone else will 
pick it up.


was (Author: pwendell):
Yes, we should add a check here similar to the existing ones for the 
thriftserver class:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143

> Thrift server fails with confusing error message when deploy-mode is cluster
> 
>
> Key: SPARK-5176
> URL: https://issues.apache.org/jira/browse/SPARK-5176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Tom Panning
>  Labels: starter
>
> With Spark 1.2.0, when I try to run
> {noformat}
> $SPARK_HOME/sbin/start-thriftserver.sh --deploy-mode cluster --master 
> spark://xd-spark.xdata.data-tactics-corp.com:7077
> {noformat}
> The log output is
> {noformat}
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Spark Command: /usr/java/latest/bin/java -cp 
> ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar
>  -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit 
> --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 
> --deploy-mode cluster --master 
> spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal
> 
> Jar url 'spark-internal' is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, 
> file:///XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> Options:
>-c CORES, --cores CORESNumber of cores to request (default: 1)
>-m MEMORY, --memory MEMORY Megabytes of memory to request (default: 
> 512)
>-s, --superviseWhether to restart the driver on failure
>-v, --verbose  Print more debugging output
>  
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> {noformat}
> I do not get this error if deploy-mode is set to client. The --deploy-mode 
> option is described by the --help output, so I expected it to work. I 
> checked, and this behavior seems to be present in Spark 1.1.0 as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5176) Thrift server fails with confusing error message when deploy-mode is cluster

2015-01-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279869#comment-14279869
 ] 

Patrick Wendell commented on SPARK-5176:


Yes, we should add a check here similar to the existing ones for the 
thriftserver class:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143

> Thrift server fails with confusing error message when deploy-mode is cluster
> 
>
> Key: SPARK-5176
> URL: https://issues.apache.org/jira/browse/SPARK-5176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Tom Panning
>  Labels: starter
>
> With Spark 1.2.0, when I try to run
> {noformat}
> $SPARK_HOME/sbin/start-thriftserver.sh --deploy-mode cluster --master 
> spark://xd-spark.xdata.data-tactics-corp.com:7077
> {noformat}
> The log output is
> {noformat}
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Spark Command: /usr/java/latest/bin/java -cp 
> ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar
>  -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit 
> --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 
> --deploy-mode cluster --master 
> spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal
> 
> Jar url 'spark-internal' is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, 
> file:///XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> Options:
>-c CORES, --cores CORESNumber of cores to request (default: 1)
>-m MEMORY, --memory MEMORY Megabytes of memory to request (default: 
> 512)
>-s, --superviseWhether to restart the driver on failure
>-v, --verbose  Print more debugging output
>  
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> {noformat}
> I do not get this error if deploy-mode is set to client. The --deploy-mode 
> option is described by the --help output, so I expected it to work. I 
> checked, and this behavior seems to be present in Spark 1.1.0 as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5176) Thrift server fails with confusing error message when deploy-mode is cluster

2015-01-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5176:
---
Labels: starter  (was: )

> Thrift server fails with confusing error message when deploy-mode is cluster
> 
>
> Key: SPARK-5176
> URL: https://issues.apache.org/jira/browse/SPARK-5176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Tom Panning
>  Labels: starter
>
> With Spark 1.2.0, when I try to run
> {noformat}
> $SPARK_HOME/sbin/start-thriftserver.sh --deploy-mode cluster --master 
> spark://xd-spark.xdata.data-tactics-corp.com:7077
> {noformat}
> The log output is
> {noformat}
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Spark Command: /usr/java/latest/bin/java -cp 
> ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar
>  -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit 
> --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 
> --deploy-mode cluster --master 
> spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal
> 
> Jar url 'spark-internal' is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, 
> file:///XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> Options:
>-c CORES, --cores CORESNumber of cores to request (default: 1)
>-m MEMORY, --memory MEMORY Megabytes of memory to request (default: 
> 512)
>-s, --superviseWhether to restart the driver on failure
>-v, --verbose  Print more debugging output
>  
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> {noformat}
> I do not get this error if deploy-mode is set to client. The --deploy-mode 
> option is described by the --help output, so I expected it to work. I 
> checked, and this behavior seems to be present in Spark 1.1.0 as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5216) Spark Ui should report estimated time remaining for each stage.

2015-01-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279863#comment-14279863
 ] 

Patrick Wendell commented on SPARK-5216:


This has been proposed before, but in the past we decided not to do it. Trying 
to extrapolate the finish time of a stage accurately is basically impossible 
since in many workloads stragglers dominate the total response time. The 
conclusion was that it was better to give no estimate rather than one which is 
likely to be misleading. 

> Spark Ui should report estimated time remaining for each stage.
> ---
>
> Key: SPARK-5216
> URL: https://issues.apache.org/jira/browse/SPARK-5216
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, Web UI
>Affects Versions: 1.3.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>
> Per stage feedback on estimated remaining time can help user get a grasp on 
> how much time the job is going to take. This will only require changes on the 
> UI/JobProgressListener side of code since we already have most of the 
> information needed. 
> In the initial cut, plan is to estimate time based on statistics of running 
> job i.e. average time taken by each task and number of task per stage. This 
> will makes sense when jobs are long. And then if this makes sense, then more 
> heuristics can be added like projected time saved if the rdd is cached and so 
> on. 
> More precise details will come as this evolves. In the meantime thoughts on 
> alternate ways and suggestion on usefulness are welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode

2015-01-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4955:
---
Target Version/s: 1.3.0

> Dynamic allocation doesn't work in YARN cluster mode
> 
>
> Key: SPARK-4955
> URL: https://issues.apache.org/jira/browse/SPARK-4955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Chengxiang Li
>Assignee: Lianhui Wang
>Priority: Blocker
>
> With executor dynamic scaling enabled, in yarn-cluster mode, after query 
> finished and spark.dynamicAllocation.executorIdleTimeout interval, executor 
> number is not reduced to configured min number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode

2015-01-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4955:
---
Priority: Blocker  (was: Critical)

> Dynamic allocation doesn't work in YARN cluster mode
> 
>
> Key: SPARK-4955
> URL: https://issues.apache.org/jira/browse/SPARK-4955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Chengxiang Li
>Assignee: Lianhui Wang
>Priority: Blocker
>
> With executor dynamic scaling enabled, in yarn-cluster mode, after query 
> finished and spark.dynamicAllocation.executorIdleTimeout interval, executor 
> number is not reduced to configured min number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2630) Input data size of CoalescedRDD is incorrect

2015-01-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2630.

Resolution: Duplicate

I think this is a dup of SPARK-4092.

> Input data size of CoalescedRDD is incorrect
> 
>
> Key: SPARK-2630
> URL: https://issues.apache.org/jira/browse/SPARK-2630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Davies Liu
>Assignee: Andrew Ash
>Priority: Blocker
> Attachments: overflow.tiff
>
>
> Given one big file, such as text.4.3G, put it in one task, 
> {code}
> sc.textFile("text.4.3.G").coalesce(1).count()
> {code}
> In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4092) Input metrics don't work for coalesce()'d RDD's

2015-01-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4092.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Input metrics don't work for coalesce()'d RDD's
> ---
>
> Key: SPARK-4092
> URL: https://issues.apache.org/jira/browse/SPARK-4092
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Kostas Sakellis
>Priority: Critical
> Fix For: 1.3.0
>
>
> In every case where we set input metrics (from both Hadoop and block storage) 
> we currently assume that exactly one input partition is computed within the 
> task. This is not a correct assumption in the general case. The main example 
> in the current API is coalesce(), but user-defined RDD's could also be 
> affected.
> To deal with the most general case, we would need to support the notion of a 
> single task having multiple input sources. A more surgical and less general 
> fix is to simply go to HadoopRDD and check if there are already inputMetrics 
> defined for the task with the same "type". If there are, then merge in the 
> new data rather than blowing away the old one.
> This wouldn't cover case where, e.g. a single task has input from both 
> on-disk and in-memory blocks. It _would_ cover the case where someone calls 
> coalesce on a HadoopRDD... which is more common.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4857) Add Executor Events to SparkListener

2015-01-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4857.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Add Executor Events to SparkListener
> 
>
> Key: SPARK-4857
> URL: https://issues.apache.org/jira/browse/SPARK-4857
> Project: Spark
>  Issue Type: Improvement
>Reporter: Kostas Sakellis
>Assignee: Kostas Sakellis
> Fix For: 1.3.0
>
>
> We need to add events to the SparkListener to indicate an executor has been 
> added or removed with corresponding information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Accumulator value in Spark UI

2015-01-14 Thread Patrick Wendell
It should appear in the page for any stage in which accumulators are updated.

On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip  wrote:
> Hello,
>
> From accumulator documentation, it says that if the accumulator is named, it
> will be displayed in the WebUI. However, I cannot find it anywhere.
>
> Do I need to specify anything in the spark ui config?
>
> Thanks.
>
> Justin

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Fwd: [ NOTICE ] Service Downtime Notification - R/W git repos

2015-01-13 Thread Patrick Wendell
FYI our git repo may be down for a few hours today.
-- Forwarded message --
From: "Tony Stevenson" 
Date: Jan 13, 2015 6:49 AM
Subject: [ NOTICE ] Service Downtime Notification - R/W git repos
To:
Cc:

Folks,

Please note than on Thursday 15th at 20:00 UTC the Infrastructure team
will be taking the read/write git repositories offline.  We expect
that this migration to last about 4 hours.

During the outage the service will be migrated from an old host to a
new one.   We intend to keep the URL the same for access to the repos
after the migration, but an alternate name is already in place in case
DNS updates take too long.   Please be aware it might take some hours
after the completion of the downtime for github to update and reflect
any changes.

The Infrastructure team have been trialling the new host for about a
week now, and [touch wood] have not had any problems with it.

The service is current;y available by accessing repos via:
https://git-wip-us.apache.org

If you have any questions please address them to infrastruct...@apache.org




--
Cheers,
Tony

On behalf of the Apache Infrastructure Team

--
Tony Stevenson

t...@pc-tony.com
pct...@apache.org

http://www.pc-tony.com

GPG - 1024D/51047D66
--


[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274263#comment-14274263
 ] 

Patrick Wendell commented on SPARK-4923:


[~senkwich] definitely prefer github.

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274239#comment-14274239
 ] 

Patrick Wendell edited comment on SPARK-4923 at 1/12/15 9:58 PM:
-

Hey All,

Sorry this has caused a disruption. As I said in the earlier comment. if anyone 
on these projects can submit a patch that locks down the visibility in that 
package and opening up things that are specifically needed, I'm fine to keep 
publishing it (and do so retro-actively for 1.2). We just need to look closely 
at what we are exposing because this package currently violates Spark's API 
policy. Because the Scala repl does not itself offer any kind of API stability, 
it will be hard for Spark to do same. But I think it's fine to just annotate 
and expose unstable API's here, provided projects understand the implications 
of depending on them.

[~senkwich] - since you guys are probably the heaviest user, would you be 
willing to take a crack at this? Basically start by making everything private 
and then go and unlock things that you need as Developer API's.

- Patrick


was (Author: pwendell):
Hey All,

Sorry this has caused a disruption. As I said in the earlier comment. if anyone 
on these projects can submit a patch that locks down the visibility in that 
package and opening up things that are specifically needed, I'm fine to keep 
publishing it (and do so retro-actively for 1.2). We just need to look closely 
at what we are exposing because this package currently violates Spark's API 
policy. Because the Scala repl does not itself offer any kind of API stability, 
it will be hard for Spark to do same. But I think it's fine to just annotate 
and expose unstable API's here, provided projects understand the implications 
of depending on them.

Chi - since you guys are probably the heaviest user, would you be willing to 
take a crack at this? Basically start by making everything private and then go 
and unlock things that you need as Developer API's.

- Patrick

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274239#comment-14274239
 ] 

Patrick Wendell commented on SPARK-4923:


Hey All,

Sorry this has caused a disruption. As I said in the earlier comment. if anyone 
on these projects can submit a patch that locks down the visibility in that 
package and opening up things that are specifically needed, I'm fine to keep 
publishing it (and do so retro-actively for 1.2). We just need to look closely 
at what we are exposing because this package currently violates Spark's API 
policy. Because the Scala repl does not itself offer any kind of API stability, 
it will be hard for Spark to do same. But I think it's fine to just annotate 
and expose unstable API's here, provided projects understand the implications 
of depending on them.

Chi - since you guys are probably the heaviest user, would you be willing to 
take a crack at this? Basically start by making everything private and then go 
and unlock things that you need as Developer API's.

- Patrick

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5172) spark-examples-***.jar shades a wrong Hadoop distribution

2015-01-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5172.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Sean Owen

> spark-examples-***.jar shades a wrong Hadoop distribution
> -
>
> Key: SPARK-5172
> URL: https://issues.apache.org/jira/browse/SPARK-5172
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Shixiong Zhu
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.3.0
>
>
> Steps to check it:
> 1. Download  "spark-1.2.0-bin-hadoop2.4.tgz" from 
> http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.0/spark-1.2.0-bin-hadoop2.4.tgz
> 2. unzip `spark-examples-1.2.0-hadoop2.4.0.jar`.
> 3. There is a file called `org/apache/hadoop/package-info.class` in the jar. 
> It doesn't exist in hadoop 2.4. 
> 4. Run "javap -classpath . -private -c -v  org.apache.hadoop.package-info"
> {code}
> Compiled from "package-info.java"
> interface org.apache.hadoop.package-info
>   SourceFile: "package-info.java"
>   RuntimeVisibleAnnotations: length = 0x24
>00 01 00 06 00 06 00 07 73 00 08 00 09 73 00 0A
>00 0B 73 00 0C 00 0D 73 00 0E 00 0F 73 00 10 00
>11 73 00 12 
>   minor version: 0
>   major version: 50
>   Constant pool:
> const #1 = Asciz  org/apache/hadoop/package-info;
> const #2 = class  #1; //  "org/apache/hadoop/package-info"
> const #3 = Asciz  java/lang/Object;
> const #4 = class  #3; //  java/lang/Object
> const #5 = Asciz  package-info.java;
> const #6 = Asciz  Lorg/apache/hadoop/HadoopVersionAnnotation;;
> const #7 = Asciz  version;
> const #8 = Asciz  1.2.1;
> const #9 = Asciz  revision;
> const #10 = Asciz 1503152;
> const #11 = Asciz user;
> const #12 = Asciz mattf;
> const #13 = Asciz date;
> const #14 = Asciz Wed Jul 24 13:39:35 PDT 2013;
> const #15 = Asciz url;
> const #16 = Asciz 
> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2;
> const #17 = Asciz srcChecksum;
> const #18 = Asciz 6923c86528809c4e7e6f493b6b413a9a;
> const #19 = Asciz SourceFile;
> const #20 = Asciz RuntimeVisibleAnnotations;
> {
> }
> {code}
> The version is {{1.2.1}}
> It comes because a wrong hbase version settings in examples project. Here is 
> a part of the dependencly tree when runnning "mvn -Pyarn -Phadoop-2.4 
> -Dhadoop.version=2.4.0 -pl examples dependency:tree"
> {noformat}
> [INFO] +- org.apache.hbase:hbase-testing-util:jar:0.98.7-hadoop1:compile
> [INFO] |  +- 
> org.apache.hbase:hbase-common:test-jar:tests:0.98.7-hadoop1:compile
> [INFO] |  +- 
> org.apache.hbase:hbase-server:test-jar:tests:0.98.7-hadoop1:compile
> [INFO] |  |  +- com.sun.jersey:jersey-core:jar:1.8:compile
> [INFO] |  |  +- com.sun.jersey:jersey-json:jar:1.8:compile
> [INFO] |  |  |  +- org.codehaus.jettison:jettison:jar:1.1:compile
> [INFO] |  |  |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> [INFO] |  |  |  \- org.codehaus.jackson:jackson-xc:jar:1.7.1:compile
> [INFO] |  |  \- com.sun.jersey:jersey-server:jar:1.8:compile
> [INFO] |  | \- asm:asm:jar:3.3.1:test
> [INFO] |  +- org.apache.hbase:hbase-hadoop1-compat:jar:0.98.7-hadoop1:compile
> [INFO] |  +- 
> org.apache.hbase:hbase-hadoop1-compat:test-jar:tests:0.98.7-hadoop1:compile
> [INFO] |  +- org.apache.hadoop:hadoop-core:jar:1.2.1:compile
> [INFO] |  |  +- xmlenc:xmlenc:jar:0.52:compile
> [INFO] |  |  +- commons-configuration:commons-configuration:jar:1.6:compile
> [INFO] |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
> [INFO] |  |  |  |  \- commons-beanutils:commons-beanutils:jar:1.7.0:compile
> [INFO] |  |  |  \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
> [INFO] |  |  \- commons-el:commons-el:jar:1.0:compile
> [INFO] |  +- org.apache.hadoop:hadoop-test:jar:1.2.1:compile
> [INFO] |  |  +- org.apache.ftpserver:ftplet-api:jar:1.0.0:compile
> [INFO] |  |  +- org.apache.mina:mina-core:jar:2.0.0-M5:compile
> [INFO] |  |  +- org.apache.ftpserver:ftpserver-core:jar:1.0.0:compile
> [INFO] |  |  \- org.apache.ftpserver:ftpserver-deprecated:jar:1.0.0-M2:compile
> [INFO] |  +- 
> com.github.stephenc.findbugs:findbugs-annotations:jar:1.3.9-1:compile
> [INFO] |  \- junit:junit:jar:4.10:test
> [INFO] | \- org.hamcrest:hamcrest-core:jar:1.1:test
> {noformat}
> If I ran `mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -pl examples -am 
> dependency:tree -Dhbase.profile=hadoop2`, the dependency tree is right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5078) Allow setting Akka host name from env vars

2015-01-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5078.

   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

> Allow setting Akka host name from env vars
> --
>
> Key: SPARK-5078
> URL: https://issues.apache.org/jira/browse/SPARK-5078
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.3.0, 1.2.1
>
>
> Current spark lets you set the ip address using SPARK_LOCAL_IP, but then this 
> is given to akka after doing a reverse DNS lookup.  This makes it difficult 
> to run spark in Docker.  You can already change the hostname that is used 
> programmatically, but it would be nice to be able to do this with an 
> environment variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5102) CompressedMapStatus needs to be registered with Kryo

2015-01-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5102.

   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

Fixed by: https://github.com/apache/spark/pull/4007

> CompressedMapStatus needs to be registered with Kryo
> 
>
> Key: SPARK-5102
> URL: https://issues.apache.org/jira/browse/SPARK-5102
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Daniel Darabos
>Assignee: Lianhui Wang
>Priority: Minor
> Fix For: 1.3.0, 1.2.1
>
>
> After upgrading from Spark 1.1.0 to 1.2.0 I got this exception:
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Class is 
> not registered: org.apache.spark.scheduler.CompressedMapStatus
> Note: To register this class use: 
> kryo.register(org.apache.spark.scheduler.CompressedMapStatus.class);
>   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
>   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I had to register {{org.apache.spark.scheduler.CompressedMapStatus}} with 
> Kryo. I think this should be done in 
> {{spark/serializer/KryoSerializer.scala}}, unless instances of this class are 
> not expected to be sent over the wire. (Maybe I'm doing something wrong?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5102) CompressedMapStatus needs to be registered with Kryo

2015-01-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5102:
---
Target Version/s: 1.2.1
Assignee: Lianhui Wang

> CompressedMapStatus needs to be registered with Kryo
> 
>
> Key: SPARK-5102
> URL: https://issues.apache.org/jira/browse/SPARK-5102
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Daniel Darabos
>Assignee: Lianhui Wang
>Priority: Minor
>
> After upgrading from Spark 1.1.0 to 1.2.0 I got this exception:
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Class is 
> not registered: org.apache.spark.scheduler.CompressedMapStatus
> Note: To register this class use: 
> kryo.register(org.apache.spark.scheduler.CompressedMapStatus.class);
>   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
>   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I had to register {{org.apache.spark.scheduler.CompressedMapStatus}} with 
> Kryo. I think this should be done in 
> {{spark/serializer/KryoSerializer.scala}}, unless instances of this class are 
> not expected to be sent over the wire. (Maybe I'm doing something wrong?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2015-01-11 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273225#comment-14273225
 ] 

Patrick Wendell commented on SPARK-3561:


So if the question is: "Is Spark only API or is it an integrated API/execution 
engine"... we've taken a fairly clear stance over the history of the project 
that it's an integrated engine. I.e. Spark is not something like Pig where it's 
intended primarily as a user API and we expect there to be different physical 
execution engines plugged in underneath.

In the past we haven't found this prevents Spark from working well in different 
environments. For instance, with Mesos, on YARN, etc. And for this we've 
integrated at different layers such as the storage layer and the scheduling 
layer, where there were well defined API's and integration points in the 
broader ecosystem. Compared with alternatives Spark is far more flexible in 
terms of runtime environments. The RDD API is so generic that it's very easy to 
customize and integrate.

For this reason, my feeling with decoupling execution from the rest of Spark is 
that it would tie our hands architecturally and not add much benefit. I don't 
see a good reason to make this broader change in the strategy of the project.

If there are specific improvements you see for making Spark work well on YARN, 
then we can definitely look at them.

> Allow for pluggable execution contexts in Spark
> ---
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark provides integration with external resource-managers such as 
> Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
> current architecture of Spark-on-YARN can be enhanced to provide 
> significantly better utilization of cluster resources for large scale, batch 
> and/or ETL applications when run alongside other applications (Spark and 
> others) and services in YARN. 
> Proposal: 
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - a gateway and a delegate to Hadoop execution environment - as a non-public 
> api (@Experimental) not exposed to end users of Spark. 
> The trait will define 6 operations: 
> * hadoopFile 
> * newAPIHadoopFile 
> * broadcast 
> * runJob 
> * persist
> * unpersist
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext. 
> Please see the attached design doc for more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5166) Stabilize Spark SQL APIs

2015-01-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5166:
---
Priority: Blocker  (was: Critical)

> Stabilize Spark SQL APIs
> 
>
> Key: SPARK-5166
> URL: https://issues.apache.org/jira/browse/SPARK-5166
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
>
> Before we take Spark SQL out of alpha, we need to audit the APIs and 
> stabilize them. 
> As a general rule, everything under org.apache.spark.sql.catalyst should not 
> be exposed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3340) Deprecate ADD_JARS and ADD_FILES

2015-01-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3340:
---
Labels: starter  (was: )

> Deprecate ADD_JARS and ADD_FILES
> 
>
> Key: SPARK-3340
> URL: https://issues.apache.org/jira/browse/SPARK-3340
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>  Labels: starter
>
> These were introduced before Spark submit even existed. Now that there are 
> many better ways of setting jars and python files through Spark submit, we 
> should deprecate these environment variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3450) Enable specifiying the --jars CLI option multiple times

2015-01-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3450.

Resolution: Won't Fix

I'd prefer not to do this one, it complicates our parsing substantially. It's 
possible to just write a bash loop that creates a single long list of jars.

> Enable specifiying the --jars CLI option multiple times
> ---
>
> Key: SPARK-3450
> URL: https://issues.apache.org/jira/browse/SPARK-3450
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.2
>Reporter: wolfgang hoschek
>
> spark-submit should support specifiying the --jars option multiple time, e.g. 
> --jars foo.jar,bar.jar --jars baz.jar,oops.jar should be equivalent to --jars 
> foo.jar,bar.jar,baz.jar,oops.jar
> This would allow using wrapper scripts that simplify usage for enterprise 
> customers along the following lines:
> {code}
> my-spark-submit.sh:
> jars=
> for i in /opt/myapp/*.jar; do
>   if [ $i -gt 0]
>   then
> jars="$jars,"
>   fi
>   jars="$jars$i"
> done
> spark-submit --jars "$jars" "$@"
> {code}
> Example usage:
> {code}
> my-spark-submit.sh --jars myUserDefinedFunction.jar 
> {code}
> The relevant enhancement code might go into SparkSubmitArguments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Job priority

2015-01-11 Thread Patrick Wendell
Priority scheduling isn't something we've supported in Spark and we've
opted to support FIFO and Fair scheduling and asked users to try and
fit these to the needs of their applications.

In practice from what I've seen of priority schedulers, such as the
linux CPU scheduler, is that strict priority scheduling is never used
in practice because of priority starvation and other issues. So you
have this second tier of heuristics that exist to deal with issues
like starvation, priority inversion, etc, and these become very
complex over time.

That said, I looked a this a bit with @kayousterhout and I don't think
it would be very hard to implement a simple priority scheduler in the
current architecture. My main concern would be additional complexity
that would develop over time, based on looking at previous
implementations in the wild.

Alessandro, would you be able to open a JIRA and list some of your
requirements there? That way we could hear whether other people have
similar needs.

- Patrick

On Sun, Jan 11, 2015 at 10:07 AM, Mark Hamstra  wrote:
> Yes, if you are asking about developing a new priority queue job scheduling
> feature and not just about how job scheduling currently works in Spark, the
> that's a dev list issue.  The current job scheduling priority is at the
> granularity of pools containing jobs, not the jobs themselves; so if you
> require strictly job-level priority queuing, that would require a new
> development effort -- and one that I expect will involve a lot of tricky
> corner cases.
>
> Sorry for misreading the nature of your initial inquiry.
>
> On Sun, Jan 11, 2015 at 7:36 AM, Alessandro Baretta 
> wrote:
>
>> Cody,
>>
>> While I might be able to improve the scheduling of my jobs by using a few
>> different pools with weights equal to, say, 1, 1e3 and 1e6, effectively
>> getting a small handful of priority classes. Still, this is really not
>> quite what I am describing. This is why my original post was on the dev
>> list. Let me then ask if there is any interest in having priority queue job
>> scheduling in Spark. This is something I might be able to pull off.
>>
>> Alex
>>
>> On Sun, Jan 11, 2015 at 6:21 AM, Cody Koeninger 
>> wrote:
>>
>>> If you set up a number of pools equal to the number of different priority
>>> levels you want, make the relative weights of those pools very different,
>>> and submit a job to the pool representing its priority, I think youll get
>>> behavior equivalent to a priority queue. Try it and see.
>>>
>>> If I'm misunderstandng what youre trying to do, then I don't know.
>>>
>>>
>>> On Sunday, January 11, 2015, Alessandro Baretta 
>>> wrote:
>>>
 Cody,

 Maybe I'm not getting this, but it doesn't look like this page is
 describing a priority queue scheduling policy. What this section discusses
 is how resources are shared between queues. A weight-1000 pool will get
 1000 times more resources allocated to it than a priority 1 queue. Great,
 but not what I want. I want to be able to define an Ordering on make my
 tasks representing their priority, and have Spark allocate all resources to
 the job that has the highest priority.

 Alex

 On Sat, Jan 10, 2015 at 10:11 PM, Cody Koeninger 
 wrote:

>
> http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
>
> "Setting a high weight such as 1000 also makes it possible to
> implement *priority* between pools--in essence, the weight-1000 pool
> will always get to launch tasks first whenever it has jobs active."
>
> On Sat, Jan 10, 2015 at 11:57 PM, Alessandro Baretta <
> alexbare...@gmail.com> wrote:
>
>> Mark,
>>
>> Thanks, but I don't see how this documentation solves my problem. You
>> are referring me to documentation of fair scheduling; whereas, I am 
>> asking
>> about as unfair a scheduling policy as can be: a priority queue.
>>
>> Alex
>>
>> On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra > > wrote:
>>
>>> -dev, +user
>>>
>>> http://spark.apache.org/docs/latest/job-scheduling.html
>>>
>>>
>>> On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta <
>>> alexbare...@gmail.com> wrote:
>>>
 Is it possible to specify a priority level for a job, such that the
 active
 jobs might be scheduled in order of priority?

 Alex

>>>
>>>
>>
>

>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Resolved] (SPARK-4399) Support multiple cloud providers

2015-01-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4399.

Resolution: Won't Fix

We'll let the community take this one on.

> Support multiple cloud providers
> 
>
> Key: SPARK-4399
> URL: https://issues.apache.org/jira/browse/SPARK-4399
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> We currently have Spark startup scripts for Amazon EC2 but not for various 
> other cloud providers.  This ticket is an umbrella to support multiple cloud 
> providers in the bundled scripts, not just Amazon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1422) Add scripts for launching Spark on Google Compute Engine

2015-01-11 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273178#comment-14273178
 ] 

Patrick Wendell commented on SPARK-1422:


Good call NIck - yeah let's close this as being out of scope since it's being 
maintained elsewhere.

> Add scripts for launching Spark on Google Compute Engine
> 
>
> Key: SPARK-1422
> URL: https://issues.apache.org/jira/browse/SPARK-1422
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1422) Add scripts for launching Spark on Google Compute Engine

2015-01-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1422.

Resolution: Won't Fix

> Add scripts for launching Spark on Google Compute Engine
> 
>
> Key: SPARK-1422
> URL: https://issues.apache.org/jira/browse/SPARK-1422
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5032) MimaExcludes should not exclude GraphX

2015-01-10 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5032.

   Resolution: Fixed
Fix Version/s: 1.3.0

> MimaExcludes should not exclude GraphX
> --
>
> Key: SPARK-5032
> URL: https://issues.apache.org/jira/browse/SPARK-5032
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.3.0
>
>
> Since GraphX is no longer alpha as of 1.2, MimaExcludes should not include 
> this line for 1.3:
> {code}
> MimaBuild.excludeSparkPackage("graphx"),
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4737) Prevent serialization errors from ever crashing the DAG scheduler

2015-01-09 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272214#comment-14272214
 ] 

Patrick Wendell commented on SPARK-4737:


It's great to see this go in. Thanks [~mcheah]!

> Prevent serialization errors from ever crashing the DAG scheduler
> -
>
> Key: SPARK-4737
> URL: https://issues.apache.org/jira/browse/SPARK-4737
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Blocker
> Fix For: 1.3.0
>
>
> Currently in Spark we assume that when tasks are serialized in the 
> TaskSetManager that the serialization cannot fail. We assume this because 
> upstream in the DAGScheduler we attempt to catch any serialization errors by 
> serializing a single partition. However, in some cases this upstream test is 
> not accurate - i.e. an RDD can have one partition that can serialize cleanly 
> but not others.
> Do do this in the proper way we need to catch and propagate the exception at 
> the time of serialization. The tricky bit is making sure it gets propagated 
> in the right way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5073) "spark.storage.memoryMapThreshold" has two default values

2015-01-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5073:
---
Summary: "spark.storage.memoryMapThreshold" has two default values  (was: 
"spark.storage.memoryMapThreshold" has two default value)

> "spark.storage.memoryMapThreshold" has two default values
> -
>
> Key: SPARK-5073
> URL: https://issues.apache.org/jira/browse/SPARK-5073
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Jianhui Yuan
>Priority: Minor
>
> In org.apache.spark.storage.DiskStore:
>  val minMemoryMapBytes = 
> blockManager.conf.getLong("spark.storage.memoryMapThreshold", 2 * 4096L)
> In org.apache.spark.network.util.TransportConf:
>  public int memoryMapBytes() {
>  return conf.getInt("spark.storage.memoryMapThreshold", 2 * 1024 * 
> 1024);
>  }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5163) Load properties from configuration file for example spark-defaults.conf when creating SparkConf object

2015-01-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5163.

Resolution: Won't Fix

I'd prefer not to accept this patch for now - the spark-defaults.conf concept 
was isolated to Spark submit intentionally to keep things simple. It's easy for 
users to just implement the logic here on their own if they want to do more 
customized type of job submission.

> Load properties from configuration file for example spark-defaults.conf when 
> creating SparkConf object
> --
>
> Key: SPARK-5163
> URL: https://issues.apache.org/jira/browse/SPARK-5163
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: YanTang Zhai
>Priority: Minor
>
> I create and run a Spark program which does not use SparkSubmit.
> When I create a SparkConf object with `new SparkConf()`, it will not 
> automatically load properties from configuration file for example 
> spark-defaults.conf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1143) ClusterSchedulerSuite (soon to be TaskSchedulerImplSuite) does not actually test the ClusterScheduler/TaskSchedulerImpl

2015-01-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1143.

Resolution: Fixed
  Assignee: Kay Ousterhout  (was: Nan Zhu)

> ClusterSchedulerSuite (soon to be TaskSchedulerImplSuite) does not actually 
> test the ClusterScheduler/TaskSchedulerImpl
> ---
>
> Key: SPARK-1143
> URL: https://issues.apache.org/jira/browse/SPARK-1143
> Project: Spark
>  Issue Type: Bug
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> This test should probably be both refactored and renamed -- it really tests 
> the Pool / fair scheduling mechanisms and completely bypasses the scheduling 
> code in TaskSchedulerImpl and TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5073) "spark.storage.memoryMapThreshold" has two default value

2015-01-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5073:
---
Summary: "spark.storage.memoryMapThreshold" has two default value  (was: 
"spark.storage.memoryMapThreshold" have two default value)

> "spark.storage.memoryMapThreshold" has two default value
> 
>
> Key: SPARK-5073
> URL: https://issues.apache.org/jira/browse/SPARK-5073
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Jianhui Yuan
>Priority: Minor
>
> In org.apache.spark.storage.DiskStore:
>  val minMemoryMapBytes = 
> blockManager.conf.getLong("spark.storage.memoryMapThreshold", 2 * 4096L)
> In org.apache.spark.network.util.TransportConf:
>  public int memoryMapBytes() {
>  return conf.getInt("spark.storage.memoryMapThreshold", 2 * 1024 * 
> 1024);
>  }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5136) Improve documentation around setting up Spark IntelliJ project

2015-01-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5136.

   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

> Improve documentation around setting up Spark IntelliJ project
> --
>
> Key: SPARK-5136
> URL: https://issues.apache.org/jira/browse/SPARK-5136
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.3.0, 1.2.1
>
>
> [The documentation about setting up a Spark project in 
> Intellij|http://spark.apache.org/docs/latest/building-spark.html#using-with-intellij-idea]
>  is somewhat short/cryptic and targets [an IntelliJ version released in 
> 2012|https://www.jetbrains.com/company/history.jsp]. A refresh / upgrade is 
> probably warranted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5136) Improve documentation around setting up Spark IntelliJ project

2015-01-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5136:
---
Assignee: Sean Owen

> Improve documentation around setting up Spark IntelliJ project
> --
>
> Key: SPARK-5136
> URL: https://issues.apache.org/jira/browse/SPARK-5136
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.3.0, 1.2.1
>
>
> [The documentation about setting up a Spark project in 
> Intellij|http://spark.apache.org/docs/latest/building-spark.html#using-with-intellij-idea]
>  is somewhat short/cryptic and targets [an IntelliJ version released in 
> 2012|https://www.jetbrains.com/company/history.jsp]. A refresh / upgrade is 
> probably warranted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark development with IntelliJ

2015-01-08 Thread Patrick Wendell
Actually I went ahead and did it.

On Thu, Jan 8, 2015 at 10:25 PM, Patrick Wendell  wrote:
> Nick - yes. Do you mind moving it? I should have put it in the
> "Contributing to Spark" page.
>
> On Thu, Jan 8, 2015 at 3:22 PM, Nicholas Chammas
>  wrote:
>> Side question: Should this section
>> <https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup>
>> in
>> the wiki link to Useful Developer Tools
>> <https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools>?
>>
>> On Thu Jan 08 2015 at 6:19:55 PM Sean Owen  wrote:
>>
>>> I remember seeing this too, but it seemed to be transient. Try
>>> compiling again. In my case I recall that IJ was still reimporting
>>> some modules when I tried to build. I don't see this error in general.
>>>
>>> On Thu, Jan 8, 2015 at 10:38 PM, Bill Bejeck  wrote:
>>> > I was having the same issue and that helped.  But now I get the following
>>> > compilation error when trying to run a test from within Intellij (v 14)
>>> >
>>> > /Users/bbejeck/dev/github_clones/bbejeck-spark/sql/
>>> catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
>>> > Error:(308, 109) polymorphic expression cannot be instantiated to
>>> expected
>>> > type;
>>> >  found   : [T(in method
>>> > apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method
>>> apply)]
>>> >  required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(
>>> in
>>> > method functionToUdfBuilder)]
>>> >   implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]):
>>> > ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
>>> >
>>> > Any thoughts?
>>> >
>>> > ^
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Commented] (SPARK-5136) Improve documentation around setting up Spark IntelliJ project

2015-01-08 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270626#comment-14270626
 ] 

Patrick Wendell commented on SPARK-5136:


I've updated it to be in the new location.

> Improve documentation around setting up Spark IntelliJ project
> --
>
> Key: SPARK-5136
> URL: https://issues.apache.org/jira/browse/SPARK-5136
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> [The documentation about setting up a Spark project in 
> Intellij|http://spark.apache.org/docs/latest/building-spark.html#using-with-intellij-idea]
>  is somewhat short/cryptic and targets [an IntelliJ version released in 
> 2012|https://www.jetbrains.com/company/history.jsp]. A refresh / upgrade is 
> probably warranted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5136) Improve documentation around setting up Spark IntelliJ project

2015-01-08 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270624#comment-14270624
 ] 

Patrick Wendell commented on SPARK-5136:


Hey Guys,

I wrote that on the wiki quite recently, but yes I think the YARN stuff has 
changed. Also, I should have put that on the more visible wiki page in the 
first place:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IntelliJ

Maybe we should link to that section instead and we can update the wiki also.

> Improve documentation around setting up Spark IntelliJ project
> --
>
> Key: SPARK-5136
> URL: https://issues.apache.org/jira/browse/SPARK-5136
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> [The documentation about setting up a Spark project in 
> Intellij|http://spark.apache.org/docs/latest/building-spark.html#using-with-intellij-idea]
>  is somewhat short/cryptic and targets [an IntelliJ version released in 
> 2012|https://www.jetbrains.com/company/history.jsp]. A refresh / upgrade is 
> probably warranted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark development with IntelliJ

2015-01-08 Thread Patrick Wendell
Nick - yes. Do you mind moving it? I should have put it in the
"Contributing to Spark" page.

On Thu, Jan 8, 2015 at 3:22 PM, Nicholas Chammas
 wrote:
> Side question: Should this section
> 
> in
> the wiki link to Useful Developer Tools
> ?
>
> On Thu Jan 08 2015 at 6:19:55 PM Sean Owen  wrote:
>
>> I remember seeing this too, but it seemed to be transient. Try
>> compiling again. In my case I recall that IJ was still reimporting
>> some modules when I tried to build. I don't see this error in general.
>>
>> On Thu, Jan 8, 2015 at 10:38 PM, Bill Bejeck  wrote:
>> > I was having the same issue and that helped.  But now I get the following
>> > compilation error when trying to run a test from within Intellij (v 14)
>> >
>> > /Users/bbejeck/dev/github_clones/bbejeck-spark/sql/
>> catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
>> > Error:(308, 109) polymorphic expression cannot be instantiated to
>> expected
>> > type;
>> >  found   : [T(in method
>> > apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method
>> apply)]
>> >  required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(
>> in
>> > method functionToUdfBuilder)]
>> >   implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]):
>> > ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
>> >
>> > Any thoughts?
>> >
>> > ^
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5152) Let metrics.properties file take an hdfs:// path

2015-01-08 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270616#comment-14270616
 ] 

Patrick Wendell edited comment on SPARK-5152 at 1/9/15 6:19 AM:


Should we be loading the metrics properties on executors in the first place? 
Maybe that's the issue. I haven't looked at the code in a while but I'm not 
sure people use this in a way where they expect to be able to query executors 
for metrics.


was (Author: pwendell):
Should we be loading the metrics properties on executors in the first place? 
Maybe that's the issue. Since executors are ephemeral you can't query them for 
any metrics anyways, right?

> Let metrics.properties file take an hdfs:// path
> 
>
> Key: SPARK-5152
> URL: https://issues.apache.org/jira/browse/SPARK-5152
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>
> From my reading of [the 
> code|https://github.com/apache/spark/blob/06dc4b5206a578065ebbb6bb8d54246ca007397f/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L53],
>  the {{spark.metrics.conf}} property must be a path that is resolvable on the 
> local filesystem of each executor.
> Running a Spark job with {{--conf 
> spark.metrics.conf=hdfs://host1.domain.com/path/metrics.properties}} logs 
> many errors (~1 per executor, presumably?) like:
> {code}
> 15/01/08 13:20:57 ERROR metrics.MetricsConfig: Error loading configure file
> java.io.FileNotFoundException: hdfs:/host1.domain.com/path/metrics.properties 
> (No such file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:53)
> at 
> org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:92)
> at 
> org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:218)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:329)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:181)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:131)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:60)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> {code}
> which seems consistent with the idea that it's looking on the local 
> filesystem and not parsing the "scheme" portion of the URL.
> Letting all executors get their {{metrics.properties}} files from one 
> location on HDFS would be an improvement, right?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5152) Let metrics.properties file take an hdfs:// path

2015-01-08 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270616#comment-14270616
 ] 

Patrick Wendell commented on SPARK-5152:


Should we be loading the metrics properties on executors in the first place? 
Maybe that's the issue. Since executors are ephemeral you can't query them for 
any metrics anyways, right?

> Let metrics.properties file take an hdfs:// path
> 
>
> Key: SPARK-5152
> URL: https://issues.apache.org/jira/browse/SPARK-5152
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>
> From my reading of [the 
> code|https://github.com/apache/spark/blob/06dc4b5206a578065ebbb6bb8d54246ca007397f/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L53],
>  the {{spark.metrics.conf}} property must be a path that is resolvable on the 
> local filesystem of each executor.
> Running a Spark job with {{--conf 
> spark.metrics.conf=hdfs://host1.domain.com/path/metrics.properties}} logs 
> many errors (~1 per executor, presumably?) like:
> {code}
> 15/01/08 13:20:57 ERROR metrics.MetricsConfig: Error loading configure file
> java.io.FileNotFoundException: hdfs:/host1.domain.com/path/metrics.properties 
> (No such file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:53)
> at 
> org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:92)
> at 
> org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:218)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:329)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:181)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:131)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:60)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> {code}
> which seems consistent with the idea that it's looking on the local 
> filesystem and not parsing the "scheme" portion of the URL.
> Letting all executors get their {{metrics.properties}} files from one 
> location on HDFS would be an improvement, right?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2620) case class cannot be used as key for reduce

2015-01-08 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2620:
---
Assignee: Tobias Schlatter

> case class cannot be used as key for reduce
> ---
>
> Key: SPARK-2620
> URL: https://issues.apache.org/jira/browse/SPARK-2620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.0.0, 1.1.0
> Environment: reproduced on spark-shell local[4]
>Reporter: Gerard Maas
>Assignee: Tobias Schlatter
>Priority: Critical
>  Labels: case-class, core
>
> Using a case class as a key doesn't seem to work properly on Spark 1.0.0
> A minimal example:
> case class P(name:String)
> val ps = Array(P("alice"), P("bob"), P("charly"), P("bob"))
> sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect
> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
> (P(bob),1), (P(abe),1), (P(charly),1))
> In contrast to the expected behavior, that should be equivalent to:
> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect
> Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
> groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4048) Enhance and extend hadoop-provided profile

2015-01-08 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4048.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Enhance and extend hadoop-provided profile
> --
>
> Key: SPARK-4048
> URL: https://issues.apache.org/jira/browse/SPARK-4048
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0
>
>
> The hadoop-provided profile is used to not package Hadoop dependencies inside 
> the Spark assembly. It works, sort of, but it could use some enhancements. A 
> quick list:
> - It doesn't include all things that could be removed from the assembly
> - It doesn't work well when you're publishing artifacts based on it 
> (SPARK-3812 fixes this)
> - There are other dependencies that could use similar treatment: Hive, HBase 
> (for the examples), Flume, Parquet, maybe others I'm missing at the moment.
> - Unit tests, more specifically, those that use local-cluster mode, do not 
> work when the assembly is built with this profile enabled.
> - The scripts to launch Spark jobs do not add needed "provided" jars to the 
> classpath when this profile is enabled, leaving it for people to figure that 
> out for themselves.
> - The examples assembly duplicates a lot of things in the main assembly.
> Part of this task is selfish since we build internally with this profile and 
> we'd like to make it easier for us to merge changes without having to keep 
> too many patches on top of upstream. But those feel like good improvements to 
> me, regardless.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2015-01-08 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5158:
---
Description: 
There have been a handful of patches for allowing access to Kerberized HDFS 
clusters in standalone mode. The main reason we haven't accepted these patches 
have been that they rely on insecure distribution of token files from the 
driver to the other components.

As a simpler solution, I wonder if we should just provide a way to have the 
Spark driver and executors independently log in and acquire credentials using a 
keytab. This would work for users who have a dedicated, single-tenant, Spark 
clusters (i.e. they are willing to have a keytab on every machine running Spark 
for their application). It wouldn't address all possible deployment scenarios, 
but if it's simple I think it's worth considering.

This would also work for Spark streaming jobs, which often run on dedicated 
hardware since they are long-running services.

  was:
There have been a handful of patches for allowing access to Kerberized HDFS 
clusters in standalone mode. The main reason we haven't accepted these patches 
have been that they rely on insecure distribution of token files from the 
driver to the other components.

As a simpler solution, I wonder if we should just provide a way to have the 
Spark driver and executors independently log in and acquire credentials using a 
keytab. This would work for users who are build dedicated, single-tenant, Spark 
clusters (i.e. they are willing to have a keytab on every machine running Spark 
for their application). It wouldn't address all possible deployment scenarios, 
but if it's simple I think it's worth considering.

This would also work for Spark streaming jobs, which often run on dedicated 
hardware since they are long-running services.


> Allow for keytab-based HDFS security in Standalone mode
> ---
>
> Key: SPARK-5158
> URL: https://issues.apache.org/jira/browse/SPARK-5158
> Project: Spark
>  Issue Type: New Feature
>      Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS 
> clusters in standalone mode. The main reason we haven't accepted these 
> patches have been that they rely on insecure distribution of token files from 
> the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the 
> Spark driver and executors independently log in and acquire credentials using 
> a keytab. This would work for users who have a dedicated, single-tenant, 
> Spark clusters (i.e. they are willing to have a keytab on every machine 
> running Spark for their application). It wouldn't address all possible 
> deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated 
> hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2015-01-08 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-5158:
--

 Summary: Allow for keytab-based HDFS security in Standalone mode
 Key: SPARK-5158
 URL: https://issues.apache.org/jira/browse/SPARK-5158
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Matthew Cheah
Priority: Critical


There have been a handful of patches for allowing access to Kerberized HDFS 
clusters in standalone mode. The main reason we haven't accepted these patches 
have been that they rely on insecure distribution of token files from the 
driver to the other components.

As a simpler solution, I wonder if we should just provide a way to have the 
Spark driver and executors independently log in and acquire credentials using a 
keytab. This would work for users who are build dedicated, single-tenant, Spark 
clusters (i.e. they are willing to have a keytab on every machine running Spark 
for their application). It wouldn't address all possible deployment scenarios, 
but if it's simple I think it's worth considering.

This would also work for Spark streaming jobs, which often run on dedicated 
hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: When will spark support "push" style shuffle?

2015-01-07 Thread Patrick Wendell
This question is conflating a few different concepts. I think the main
question is whether Spark will have a shuffle implementation that
streams data rather than persisting it to disk/cache as a buffer.
Spark currently decouples the shuffle write from the read using
disk/OS cache as a buffer. The two benefits of this approach this are
that it allows intra-query fault tolerance and it makes it easier to
elastically scale and reschedule work within a job. We consider these
to be design requirements (think about jobs that run for several hours
on hundreds of machines). Impala, and similar systems like dremel and
f1, not offer fault tolerance within a query at present. They also
require gang scheduling the entire set of resources that will exist
for the duration of a query.

A secondary question is whether our shuffle should have a barrier or
not. Spark's shuffle currently has a hard barrier between map and
reduce stages. We haven't seen really strong evidence that removing
the barrier is a net win. It can help the performance of a single job
(modestly), but in the a multi-tenant workload, it leads to poor
utilization since you have a lot of reduce tasks that are taking up
slots waiting for mappers to finish. Many large scale users of
Map/Reduce disable this feature in production clusters for that
reason. Thus, we haven't seen compelling evidence for removing the
barrier at this point, given the complexity of doing so.

It is possible that future versions of Spark will support push-based
shuffles, potentially in a mode that remove some of Spark's fault
tolerance properties. But there are many other things we can still
optimize about the shuffle that would likely come before this.

- Patrick

On Wed, Jan 7, 2015 at 6:01 PM, 曹雪林  wrote:
> Hi,
>
>   I've heard a lot of complain about spark's "pull" style shuffle. Is
> there any plan to support "push" style shuffle in the near future?
>
>   Currently, the shuffle phase must be completed before the next stage
> starts. While, it is said, in Impala, the shuffled data is "streamed" to
> the next stage handler, which greatly saves time. Will spark support this
> mechanism one day?
>
> Thanks

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14267424#comment-14267424
 ] 

Patrick Wendell commented on SPARK-1529:


BTW - I think if MapR wants to have a customized shuffle, the direction 
proposed in this patch is probably not the best way to do it. It would make 
more sense to implement a DFS-based shuffle using the new pluggable shuffle 
API. I.e. a shuffle that communicates through the filesystem rather than doing 
transfers through Spark.

> Support setting spark.local.dirs to a hadoop FileSystem 
> 
>
> Key: SPARK-1529
> URL: https://issues.apache.org/jira/browse/SPARK-1529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>    Reporter: Patrick Wendell
>Assignee: Cheng Lian
>
> In some environments, like with MapR, local volumes are accessed through the 
> Hadoop filesystem interface. We should allow setting spark.local.dir to a 
> Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14267419#comment-14267419
 ] 

Patrick Wendell commented on SPARK-1529:


Hey Sean,

>From what I remember of this, the issue is that MapR clusters are not 
>typically provisioned with much local disk space available, because the MapRFS 
>supports accessing "local" volumes in its API, unlike the HDFS API. So in 
>general the expectation is that large amounts of local data should be written 
>through MapR's API to its local filesystem. They have an NFS mount you can use 
>as a work around to provide POSIX API's, and I think most MapR users set this 
>mount up and then have Spark write shuffle data there.

Option 2 which [~rkannan82] mentions is not actually feasible in Spark right 
now. We don't support writing shuffle data through the Hadoop API's right now 
and I think Cheng's patch was only a prototype of how we might do that...

> Support setting spark.local.dirs to a hadoop FileSystem 
> 
>
> Key: SPARK-1529
> URL: https://issues.apache.org/jira/browse/SPARK-1529
> Project: Spark
>      Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Cheng Lian
>
> In some environments, like with MapR, local volumes are accessed through the 
> Hadoop filesystem interface. We should allow setting spark.local.dir to a 
> Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Hang on Executor classloader lookup for the remote REPL URL classloader

2015-01-07 Thread Patrick Wendell
Hey Andrew,

So the executors in Spark will fetch classes from the driver node for
classes defined in the repl from an HTTP server on the driver. Is this
happening in the context of a repl session? Also, is it deterministic
or does it happen only periodically?

The reason all of the other threads are hanging is that there is a
global lock around classloading, so they all queue up.

Could you attach the full stack trace from the driver? Is it possible
that something in the network is blocking the transfer of bytes
between these two processes? Based on the stack trace it looks like it
sent an HTTP request and is waiting on the result back from the
driver.

One thing to check is to verify that the TCP connection between them
used for the repl class server is still alive from the vantage point
of both the executor and driver nodes. Another thing to try would be
to temporarily open up any firewalls that are on the nodes or in the
network and see if this makes the problem go away (to isolate it to an
exogenous-to-Spark network issue).

- Patrick

On Wed, Aug 20, 2014 at 11:35 PM, Andrew Ash  wrote:
> Hi Spark devs,
>
> I'm seeing a stacktrace where the classloader that reads from the REPL is
> hung, and blocking all progress on that executor.  Below is that hung
> thread's stacktrace, and also the stacktrace of another hung thread.
>
> I thought maybe there was an issue with the REPL's JVM on the other side,
> but didn't see anything useful in that stacktrace either.
>
> Any ideas what I should be looking for?
>
> Thanks!
> Andrew
>
>
> "Executor task launch worker-0" daemon prio=10 tid=0x7f780c208000
> nid=0x6ae9 runnable [0x7f78c2eeb000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> - locked <0x7f7e13ea9560> (a java.io.BufferedInputStream)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
> - locked <0x7f7e13e9eeb0> (a
> sun.net.www.protocol.http.HttpURLConnection)
> at java.net.URL.openStream(URL.java:1037)
> at
> org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:86)
> at
> org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:63)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> - locked <0x7f7fc9018980> (a
> org.apache.spark.repl.ExecutorClassLoader)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at org.apache.avro.util.ClassUtils.forName(ClassUtils.java:102)
> at org.apache.avro.util.ClassUtils.forName(ClassUtils.java:82)
> at
> org.apache.avro.specific.SpecificData.getClass(SpecificData.java:132)
> at
> org.apache.avro.specific.SpecificDatumReader.setSchema(SpecificDatumReader.java:69)
> at
> org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:126)
> at
> org.apache.avro.file.DataFileReader.(DataFileReader.java:97)
> at
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:59)
> at
> org.apache.avro.mapred.AvroRecordReader.(AvroRecordReader.java:41)
> at
> org.apache.avro.mapred.AvroInputFormat.getRecordReader(AvroInputFormat.java:71)
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:193)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:184)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:93)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
>
> And the other threads are stuck on the Class.forName0() method too:
>
> "Executor task launch worker-4" daemon prio=10 tid=0x7f780c20f000
> nid=0x6aed waiting for monitor entry [0x7f78c2ae8000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
>

[jira] [Updated] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5097:
---
Priority: Critical  (was: Major)

> Adding data frame APIs to SchemaRDD
> ---
>
> Key: SPARK-5097
> URL: https://issues.apache.org/jira/browse/SPARK-5097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
> Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf
>
>
> SchemaRDD, through its DSL, already provides common data frame 
> functionalities. However, the DSL was originally created for constructing 
> test cases without much end-user usability and API stability consideration. 
> This design doc proposes a set of API changes for Scala and Python to make 
> the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

2015-01-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5113:
---
Description: 
Spark has multiple network components that start servers and advertise their 
network addresses to other processes.

We should go through each of these components and make sure they have 
consistent and/or documented behavior wrt (a) what interface(s) they bind to 
and (b) what hostname they use to advertise themselves to other processes. We 
should document this clearly and explain to people what to do in different 
cases (e.g. EC2, dockerized containers, etc).

When Spark initializes, it will search for a network interface until it finds 
one that is not a loopback address. Then it will do a reverse DNS lookup for a 
hostname associated with that interface. Then the network components will use 
that hostname to advertise the component to other processes. That hostname is 
also the one used for the akka system identifier (akka supports only supplying 
a single name which it uses both as the bind interface and as the actor 
identifier). In some cases, that hostname is used as the bind hostname also 
(e.g. I think this happens in the connection manager and possibly akka) - which 
will likely internally result in a re-resolution of this to an IP address. In 
other cases (the web UI and netty shuffle) we seem to bind to all interfaces.

The best outcome would be to have three configs that can be set on each machine:

{code}
SPARK_LOCAL_IP # Ip address we bind to for all services
SPARK_INTERNAL_HOSTNAME # Hostname we advertise to remote processes within the 
cluster
SPARK_EXTERNAL_HOSTNAME # Hostname we advertise to processes outside the 
cluster (e.g. the UI)
{code}

It's not clear how easily we can support that scheme while providing backwards 
compatibility. The last one (SPARK_EXTERNAL_HOSTNAME) is easy - it's just an 
alias for what is now SPARK_PUBLIC_DNS.

  was:
Spark has multiple network components that start servers and advertise their 
network addresses to other processes.

We should go through each of these components and make sure they have 
consistent and/or documented behavior wrt (a) what interface(s) they bind to 
and (b) what hostname they use to advertise themselves to other processes. We 
should document this clearly and explain to people what to do in different 
cases (e.g. EC2, dockerized containers, etc).

When Spark initializes, it will search for a network interface until it finds 
one that is not a loopback address. Then it will do a reverse DNS lookup for a 
hostname associated with that interface. Then the network components will use 
that hostname to advertise the component to other processes. That hostname is 
also the one used for the akka system identifier (akka supports only supplying 
a single name which it uses both as the bind interface and as the actor 
identifier). In some cases, that hostname is used as the bind hostname also 
(e.g. I think this happens in the connection manager and possibly akka) - which 
will likely internally result in a re-resolution of this to an IP address. In 
other cases (the web UI and netty shuffle) we seem to bind to all interfaces.


> Audit and document use of hostnames and IP addresses in Spark
> -
>
> Key: SPARK-5113
> URL: https://issues.apache.org/jira/browse/SPARK-5113
> Project: Spark
>  Issue Type: Bug
>    Reporter: Patrick Wendell
>Priority: Critical
>
> Spark has multiple network components that start servers and advertise their 
> network addresses to other processes.
> We should go through each of these components and make sure they have 
> consistent and/or documented behavior wrt (a) what interface(s) they bind to 
> and (b) what hostname they use to advertise themselves to other processes. We 
> should document this clearly and explain to people what to do in different 
> cases (e.g. EC2, dockerized containers, etc).
> When Spark initializes, it will search for a network interface until it finds 
> one that is not a loopback address. Then it will do a reverse DNS lookup for 
> a hostname associated with that interface. Then the network components will 
> use that hostname to advertise the component to other processes. That 
> hostname is also the one used for the akka system identifier (akka supports 
> only supplying a single name which it uses both as the bind interface and as 
> the actor identifier). In some cases, that hostname is used as the bind 
> hostname also (e.g. I think this happens in the connection manager and 
> possibly akka) - which will likely internally result in a re-resolution of 
> this to an IP address. In other cases (the web UI and netty shuffle) we seem 
> to bin

[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

2015-01-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5113:
---
Description: 
Spark has multiple network components that start servers and advertise their 
network addresses to other processes.

We should go through each of these components and make sure they have 
consistent and/or documented behavior wrt (a) what interface(s) they bind to 
and (b) what hostname they use to advertise themselves to other processes. We 
should document this clearly and explain to people what to do in different 
cases (e.g. EC2, dockerized containers, etc).

When Spark initializes, it will search for a network interface until it finds 
one that is not a loopback address. Then it will do a reverse DNS lookup for a 
hostname associated with that interface. Then the network components will use 
that hostname to advertise the component to other processes. That hostname is 
also the one used for the akka system identifier (akka supports only supplying 
a single name which it uses both as the bind interface and as the actor 
identifier). In some cases, that hostname is used as the bind hostname also 
(e.g. I think this happens in the connection manager and possibly akka) - which 
will likely internally result in a re-resolution of this to an IP address. In 
other cases (the web UI and netty shuffle) we seem to bind to all interfaces.

  was:
Spark has multiple network components that start servers and advertise their 
network addresses to other processes.

We should go through each of these components and make sure they have 
consistent and/or documented behavior wrt (a) what interface(s) they bind to 
and (b) what hostname they use to advertise themselves to other processes. We 
should document this clearly and explain to people what to do in different 
cases (e.g. EC2, dockerized containers, etc).

When Spark initializes, it will search for a network interface until it finds 
one that is not a loopback address. Then it will do a reverse DNS lookup for a 
hostname associated with that interface. Then the network components will use 
that hostname to advertise the component to other processes. That hostname is 
also the one used for the akka system identifier. In some cases, that hostname 
is used as the bind hostname also (e.g. I think this happens in the connection 
manager and possibly akka) - which will likely internally result in a 
re-resolution of this to an IP address. In other cases (the web UI and netty 
shuffle) we seem to bind to all interfaces.


> Audit and document use of hostnames and IP addresses in Spark
> -
>
> Key: SPARK-5113
> URL: https://issues.apache.org/jira/browse/SPARK-5113
> Project: Spark
>  Issue Type: Bug
>    Reporter: Patrick Wendell
>Priority: Critical
>
> Spark has multiple network components that start servers and advertise their 
> network addresses to other processes.
> We should go through each of these components and make sure they have 
> consistent and/or documented behavior wrt (a) what interface(s) they bind to 
> and (b) what hostname they use to advertise themselves to other processes. We 
> should document this clearly and explain to people what to do in different 
> cases (e.g. EC2, dockerized containers, etc).
> When Spark initializes, it will search for a network interface until it finds 
> one that is not a loopback address. Then it will do a reverse DNS lookup for 
> a hostname associated with that interface. Then the network components will 
> use that hostname to advertise the component to other processes. That 
> hostname is also the one used for the akka system identifier (akka supports 
> only supplying a single name which it uses both as the bind interface and as 
> the actor identifier). In some cases, that hostname is used as the bind 
> hostname also (e.g. I think this happens in the connection manager and 
> possibly akka) - which will likely internally result in a re-resolution of 
> this to an IP address. In other cases (the web UI and netty shuffle) we seem 
> to bind to all interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

2015-01-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5113:
---
Description: 
Spark has multiple network components that start servers and advertise their 
network addresses to other processes.

We should go through each of these components and make sure they have 
consistent and/or documented behavior wrt (a) what interface(s) they bind to 
and (b) what hostname they use to advertise themselves to other processes. We 
should document this clearly and explain to people what to do in different 
cases (e.g. EC2, dockerized containers, etc).

When Spark initializes, it will search for a network interface until it finds 
one that is not a loopback address. Then it will do a reverse DNS lookup for a 
hostname associated with that interface. Then the network components will use 
that hostname to advertise the component to other processes. That hostname is 
also the one used for the akka system identifier. In some cases, that hostname 
is used as the bind hostname also (e.g. I think this happens in the connection 
manager and possibly akka) - which will likely internally result in a 
re-resolution of this to an IP address. In other cases (the web UI and netty 
shuffle) we seem to bind to all interfaces.

  was:
Spark has multiple network components that start servers and advertise their 
network addresses to other processes.

We should go through each of these components and make sure they have 
consistent and/or documented behavior wrt (a) what interface(s) they bind to 
and (b) what hostname they use to advertise themselves to other processes. We 
should document this clearly and explain to people what to do in different 
cases (e.g. EC2, dockerized containers, etc).

When Spark initializes, it will search for a network interface until it finds 
one that is not a loopback address. Then it will do a reverse DNS lookup for a 
hostname associated with that interface. Then the network components will use 
that hostname to advertise the component to other processes. In some cases, 
that hostname is used as the bind hostname also (e.g. I think this happens in 
the connection manager and possibly akka) - which will likely internally result 
in a re-resolution of this to an IP address. In other cases (the web UI and 
netty shuffle) we seem to bind to all interfaces.


> Audit and document use of hostnames and IP addresses in Spark
> -
>
> Key: SPARK-5113
> URL: https://issues.apache.org/jira/browse/SPARK-5113
> Project: Spark
>  Issue Type: Bug
>    Reporter: Patrick Wendell
>Priority: Critical
>
> Spark has multiple network components that start servers and advertise their 
> network addresses to other processes.
> We should go through each of these components and make sure they have 
> consistent and/or documented behavior wrt (a) what interface(s) they bind to 
> and (b) what hostname they use to advertise themselves to other processes. We 
> should document this clearly and explain to people what to do in different 
> cases (e.g. EC2, dockerized containers, etc).
> When Spark initializes, it will search for a network interface until it finds 
> one that is not a loopback address. Then it will do a reverse DNS lookup for 
> a hostname associated with that interface. Then the network components will 
> use that hostname to advertise the component to other processes. That 
> hostname is also the one used for the akka system identifier. In some cases, 
> that hostname is used as the bind hostname also (e.g. I think this happens in 
> the connection manager and possibly akka) - which will likely internally 
> result in a re-resolution of this to an IP address. In other cases (the web 
> UI and netty shuffle) we seem to bind to all interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

2015-01-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5113:
---
Description: 
Spark has multiple network components that start servers and advertise their 
network addresses to other processes.

We should go through each of these components and make sure they have 
consistent and/or documented behavior wrt (a) what interface(s) they bind to 
and (b) what hostname they use to advertise themselves to other processes. We 
should document this clearly and explain to people what to do in different 
cases (e.g. EC2, dockerized containers, etc).

When Spark initializes, it will search for a network interface until it finds 
one that is not a loopback address. Then it will do a reverse DNS lookup for a 
hostname associated with that interface. Then the network components will use 
that hostname to advertise the component to other processes. In some cases, 
that hostname is used as the bind hostname also (e.g. I think this happens in 
the connection manager and possibly akka) - which will likely internally result 
in a re-resolution of this to an IP address. In other cases (the web UI and 
netty shuffle) we seem to bind to all interfaces.

  was:
Spark has multiple network components that start servers and advertise their 
network addresses to other processes.

We should go through each of these components and make sure they have 
consistent and/or documented behavior wrt (a) what interface(s) they bind to 
and (b) what hostname they use to advertise themselves to other processes. We 
should document this clearly and explain to people what to do in different 
cases (e.g. EC2, dockerized containers, etc).

When Spark initializes, it will search for a network interface until it finds 
one that is not a loopback address. Then it will do a reverse DNS lookup for a 
hostname associated with that interface. Then the network components will use 
that hostname to advertise the component to other processes. In some cases, 
that hostname is used as the bind interface also (e.g. I think this happens in 
the connection manager and possibly akka). In other cases (the web UI and netty 
shuffle) we seem to bind to all interfaces.


> Audit and document use of hostnames and IP addresses in Spark
> -
>
> Key: SPARK-5113
> URL: https://issues.apache.org/jira/browse/SPARK-5113
> Project: Spark
>  Issue Type: Bug
>    Reporter: Patrick Wendell
>Priority: Critical
>
> Spark has multiple network components that start servers and advertise their 
> network addresses to other processes.
> We should go through each of these components and make sure they have 
> consistent and/or documented behavior wrt (a) what interface(s) they bind to 
> and (b) what hostname they use to advertise themselves to other processes. We 
> should document this clearly and explain to people what to do in different 
> cases (e.g. EC2, dockerized containers, etc).
> When Spark initializes, it will search for a network interface until it finds 
> one that is not a loopback address. Then it will do a reverse DNS lookup for 
> a hostname associated with that interface. Then the network components will 
> use that hostname to advertise the component to other processes. In some 
> cases, that hostname is used as the bind hostname also (e.g. I think this 
> happens in the connection manager and possibly akka) - which will likely 
> internally result in a re-resolution of this to an IP address. In other cases 
> (the web UI and netty shuffle) we seem to bind to all interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

2015-01-06 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-5113:
--

 Summary: Audit and document use of hostnames and IP addresses in 
Spark
 Key: SPARK-5113
 URL: https://issues.apache.org/jira/browse/SPARK-5113
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Priority: Critical


Spark has multiple network components that start servers and advertise their 
network addresses to other processes.

We should go through each of these components and make sure they have 
consistent and/or documented behavior wrt (a) what interface(s) they bind to 
and (b) what hostname they use to advertise themselves to other processes. We 
should document this clearly and explain to people what to do in different 
cases (e.g. EC2, dockerized containers, etc).

When Spark initializes, it will search for a network interface until it finds 
one that is not a loopback address. Then it will do a reverse DNS lookup for a 
hostname associated with that interface. Then the network components will use 
that hostname to advertise the component to other processes. In some cases, 
that hostname is used as the bind interface also (e.g. I think this happens in 
the connection manager and possibly akka). In other cases (the web UI and netty 
shuffle) we seem to bind to all interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information

2015-01-05 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265416#comment-14265416
 ] 

Patrick Wendell commented on SPARK-4687:


I spent some more time looking at this and talking with [~sandyr] and 
[~joshrosen]. I think having some limited version of this is fine given that, 
from what I can tell, this is pretty difficult to implement outside of Spark. I 
am going to post further comments on the JIRA.

> SparkContext#addFile doesn't keep file folder information
> -
>
> Key: SPARK-4687
> URL: https://issues.apache.org/jira/browse/SPARK-4687
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Jimmy Xiang
>
> Files added with SparkContext#addFile are loaded with Utils#fetchFile before 
> a task starts. However, Utils#fetchFile puts all files under the Spart root 
> on the worker node. We should have an option to keep the folder information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4737) Prevent serialization errors from ever crashing the DAG scheduler

2015-01-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4737:
---
Affects Version/s: 1.0.2
   1.1.1

> Prevent serialization errors from ever crashing the DAG scheduler
> -
>
> Key: SPARK-4737
> URL: https://issues.apache.org/jira/browse/SPARK-4737
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>    Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Blocker
>
> Currently in Spark we assume that when tasks are serialized in the 
> TaskSetManager that the serialization cannot fail. We assume this because 
> upstream in the DAGScheduler we attempt to catch any serialization errors by 
> serializing a single partition. However, in some cases this upstream test is 
> not accurate - i.e. an RDD can have one partition that can serialize cleanly 
> but not others.
> Do do this in the proper way we need to catch and propagate the exception at 
> the time of serialization. The tricky bit is making sure it gets propagated 
> in the right way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark UI history job duration is wrong

2015-01-05 Thread Patrick Wendell
Thanks for reporting this - it definitely sounds like a bug. Please
open a JIRA for it. My guess is that we define the start or end time
of the job based on the current time instead of looking at data
encoded in the underlying event stream. That would cause it to not
work properly when loading from historical data.

- Patrick

On Mon, Jan 5, 2015 at 12:25 PM, Olivier Toupin
 wrote:
> Hello,
>
> I'm using Spark 1.2.0 and when running an application, if I go into the UI
> and then in the job tab ("/jobs/") the jobs duration are relevant and the
> posted durations looks ok.
>
> However when I open the history ("history/app-/jobs/") for that job,
> the duration are wrong showing milliseconds instead of the relevant job
> time. The submitted time for each job (except maybe the first) is different
> also.
>
> The stage tab is unaffected and show the correct duration for each stages in
> both mode.
>
> Should I open a bug?
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-UI-history-job-duration-is-wrong-tp10010.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark driver main thread hanging after SQL insert

2015-01-02 Thread Patrick Wendell
Hi Alessandro,

Can you create a JIRA for this rather than reporting it on the dev
list? That's where we track issues like this. Thanks!.

- Patrick

On Wed, Dec 31, 2014 at 8:48 PM, Alessandro Baretta
 wrote:
> Here's what the console shows:
>
> 15/01/01 01:12:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 58.0,
> whose tasks have all completed, from pool
> 15/01/01 01:12:29 INFO scheduler.DAGScheduler: Stage 58 (runJob at
> ParquetTableOperations.scala:326) finished in 5493.549 s
> 15/01/01 01:12:29 INFO scheduler.DAGScheduler: Job 41 finished: runJob at
> ParquetTableOperations.scala:326, took 5493.747061 s
>
> It is now 01:40:03, so the driver has been hanging for the last 28 minutes.
> The web UI on the other hand shows that all tasks completed successfully,
> and the output directory has been populated--although the _SUCCESS file is
> missing.
>
> It is worth noting that my code started this job as its own thread. The
> actual code looks like the following snippet, modulo some simplifications.
>
>   def save_to_parquet(allowExisting : Boolean = false) = {
> val threads = tables.map(table => {
>   val thread = new Thread {
> override def run {
>   table.insertInto(t.table_name)
> }
>   }
>   thread.start
>   thread
> })
> threads.foreach(_.join)
>   }
>
> As far as I can see the insertInto call never returns. Any idea why?
>
> Alex

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Created] (SPARK-5025) Write a guide for creating well-formed packages for Spark

2014-12-30 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-5025:
--

 Summary: Write a guide for creating well-formed packages for Spark
 Key: SPARK-5025
 URL: https://issues.apache.org/jira/browse/SPARK-5025
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell


There are an increasing number of OSS projects providing utilities and 
extensions to Spark. We should write a guide in the Spark docs that explains 
how to create, package, and publish a third party Spark library. There are a 
few issues here such as how to list your dependency on Spark, how to deal with 
your own third party dependencies, etc. We should also cover how to do this for 
Python libraries.

In general, we should make it easy to build extension points against any of 
Spark's API's (e.g. for new data sources, streaming receivers, ML algos, etc) 
and self-publish libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5008) Persistent HDFS does not recognize EBS Volumes

2014-12-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5008:
---
Labels:   (was: amazon aws ec2 hdfs persistent)

> Persistent HDFS does not recognize EBS Volumes
> --
>
> Key: SPARK-5008
> URL: https://issues.apache.org/jira/browse/SPARK-5008
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.2.0
> Environment: 8 Node Cluster Generated from 1.2.0 spark-ec2 script.
> -m c3.2xlarge -t c3.8xlarge --ebs-vol-size 300 --ebs-vol-type gp2 
> --ebs-vol-num 1
>Reporter: Brad Willard
>
> Cluster is built with correct size EBS volumes. It creates the volume at 
> /dev/xvds and it mounted to /vol0. However when you start persistent hdfs 
> with start-all script, it starts but it isn't correctly configured to use the 
> EBS volume.
> I'm assuming some sym links or expected mounts are not correctly configured.
> This has worked flawlessly on all previous versions of spark.
> I have a stupid workaround by installing pssh and mucking with it by mounting 
> it to /vol, which worked, however it doesn't not work between restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5008) Persistent HDFS does not recognize EBS Volumes

2014-12-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5008:
---
Component/s: EC2

> Persistent HDFS does not recognize EBS Volumes
> --
>
> Key: SPARK-5008
> URL: https://issues.apache.org/jira/browse/SPARK-5008
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.2.0
> Environment: 8 Node Cluster Generated from 1.2.0 spark-ec2 script.
> -m c3.2xlarge -t c3.8xlarge --ebs-vol-size 300 --ebs-vol-type gp2 
> --ebs-vol-num 1
>Reporter: Brad Willard
>
> Cluster is built with correct size EBS volumes. It creates the volume at 
> /dev/xvds and it mounted to /vol0. However when you start persistent hdfs 
> with start-all script, it starts but it isn't correctly configured to use the 
> EBS volume.
> I'm assuming some sym links or expected mounts are not correctly configured.
> This has worked flawlessly on all previous versions of spark.
> I have a stupid workaround by installing pssh and mucking with it by mounting 
> it to /vol, which worked, however it doesn't not work between restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries

2014-12-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4908:
---
Target Version/s: 1.2.1

> Spark SQL built for Hive 13 fails under concurrent metadata queries
> ---
>
> Key: SPARK-4908
> URL: https://issues.apache.org/jira/browse/SPARK-4908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: David Ross
>Priority: Critical
>
> We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: 
> https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6
> We are using Spark built for Hive 13, using this option:
> {{-Phive-0.13.1}}
> In single-threaded mode, normal operations look fine. However, under 
> concurrency, with at least 2 concurrent connections, metadata queries fail.
> For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} 
> statement when you pass a default schema in the JDBC URL, all fail.
> {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue.
> Here is some example code:
> {code}
> object main extends App {
>   import java.sql._
>   import scala.concurrent._
>   import scala.concurrent.duration._
>   import scala.concurrent.ExecutionContext.Implicits.global
>   Class.forName("org.apache.hive.jdbc.HiveDriver")
>   val host = "localhost" // update this
>   val url = s"jdbc:hive2://${host}:10511/some_db" // update this
>   val future = Future.traverse(1 to 3) { i =>
> Future {
>   println("Starting: " + i)
>   try {
> val conn = DriverManager.getConnection(url)
>   } catch {
> case e: Throwable => e.printStackTrace()
> println("Failed: " + i)
>   }
>   println("Finishing: " + i)
> }
>   }
>   Await.result(future, 2.minutes)
>   println("done!")
> }
> {code}
> Here is the output:
> {code}
> Starting: 1
> Starting: 3
> Starting: 2
> java.sql.SQLException: 
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
> cancelled
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
>   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
>   at 
> org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
>   at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:195)
>   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
>   at java.sql.DriverManager.getConnection(DriverManager.java:664)
>   at java.sql.DriverManager.getConnection(DriverManager.java:270)
>   at 
> com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
>   at 
> com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
>   at 
> com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Failed: 3
> Finishing: 3
> java.sql.SQLException: 
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
> cancelled
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
>   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
>   at 
> org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
>   at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:195)
>   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
>   at java.sql.DriverManager.getConnection(DriverManager.java:664)
>   at java.sql.DriverManager.getConnection(DriverManager.java:270)
>   at 
> com.atscale.engine.

Re: action progress in ipython notebook?

2014-12-28 Thread Patrick Wendell
Hey Eric,

I'm just curious - which specific features in 1.2 do you find most
help with usability? This is a theme we're focusing on for 1.3 as
well, so it's helpful to hear what makes a difference.

- Patrick

On Sun, Dec 28, 2014 at 1:36 AM, Eric Friedman
 wrote:
> Hi Josh,
>
> Thanks for the informative answer. Sounds like one should await your changes
> in 1.3. As information, I found the following set of options for doing the
> visual in a notebook.
>
> http://nbviewer.ipython.org/github/ipython/ipython/blob/3607712653c66d63e0d7f13f073bde8c0f209ba8/docs/examples/notebooks/Animations_and_Progress.ipynb
>
>
> On Dec 27, 2014, at 4:07 PM, Josh Rosen  wrote:
>
> The console progress bars are implemented on top of a new stable "status
> API" that was added in Spark 1.2.  It's possible to query job progress using
> this interface (in older versions of Spark, you could implement a custom
> SparkListener and maintain the counts of completed / running / failed tasks
> / stages yourself).
>
> There are actually several subtleties involved in implementing "job-level"
> progress bars which behave in an intuitive way; there's a pretty extensive
> discussion of the challenges at https://github.com/apache/spark/pull/3009.
> Also, check out the pull request for the console progress bars for an
> interesting design discussion around how they handle parallel stages:
> https://github.com/apache/spark/pull/3029.
>
> I'm not sure about the plumbing that would be necessary to display live
> progress updates in the IPython notebook UI, though.  The general pattern
> would probably involve a mapping to relate notebook cells to Spark jobs (you
> can do this with job groups, I think), plus some periodic timer that polls
> the driver for the status of the current job in order to update the progress
> bar.
>
> For Spark 1.3, I'm working on designing a REST interface to accesses this
> type of job / stage / task progress information, as well as expanding the
> types of information exposed through the stable status API interface.
>
> - Josh
>
> On Thu, Dec 25, 2014 at 10:01 AM, Eric Friedman 
> wrote:
>>
>> Spark 1.2.0 is SO much more usable than previous releases -- many thanks
>> to the team for this release.
>>
>> A question about progress of actions.  I can see how things are
>> progressing using the Spark UI.  I can also see the nice ASCII art animation
>> on the spark driver console.
>>
>> Has anyone come up with a way to accomplish something similar in an
>> iPython notebook using pyspark?
>>
>> Thanks
>> Eric
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Long-running job cleanup

2014-12-28 Thread Patrick Wendell
What do you mean when you say "the overhead of spark shuffles start to
accumulate"? Could you elaborate more?

In newer versions of Spark shuffle data is cleaned up automatically
when an RDD goes out of scope. It is safe to remove shuffle data at
this point because the RDD can no longer be referenced. If you are
seeing a large build up of shuffle data, it's possible you are
retaining references to older RDDs inadvertently. Could you explain
what your job actually doing?

- Patrick

On Mon, Dec 22, 2014 at 2:36 PM, Ganelin, Ilya
 wrote:
> Hi all, I have a long running job iterating over a huge dataset. Parts of
> this operation are cached. Since the job runs for so long, eventually the
> overhead of spark shuffles starts to accumulate culminating in the driver
> starting to swap.
>
> I am aware of the spark.cleanup.tll parameter that allows me to configure
> when cleanup happens but the issue with doing this is that it isn't done
> safely, e.g. I can be in the middle of processing a stage when this cleanup
> happens and my cached RDDs get cleared. This ultimately causes a
> KeyNotFoundException when I try to reference the now cleared cached RDD.
> This behavior doesn't make much sense to me, I would expect the cached RDD
> to either get regenerated or at the very least for there to be an option to
> execute this cleanup without deleting those RDDs.
>
> Is there a programmatically safe way of doing this cleanup that doesn't
> break everything?
>
> If I instead tear down the spark context and bring up a new context for
> every iteration (assuming that each iteration is sufficiently long-lived),
> would memory get released appropriately?
>
> 
>
> The information contained in this e-mail is confidential and/or proprietary
> to Capital One and/or its affiliates. The information transmitted herewith
> is intended only for use by the individual or entity to which it is
> addressed.  If the reader of this message is not the intended recipient, you
> are hereby notified that any review, retransmission, dissemination,
> distribution, copying or other use of, or taking of any action in reliance
> upon this information is strictly prohibited. If you have received this
> communication in error, please contact the sender and delete the material
> from your computer.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ANNOUNCE: New build script ./build/mvn

2014-12-27 Thread Patrick Wendell
Hi All,

A consistent piece of feedback from Spark developers has been that the
Maven build is very slow. Typesafe provides a tool called Zinc which
improves Scala complication speed substantially with Maven, but is
difficult to install and configure, especially for platforms other
than Mac OS.

I've just merged a patch (authored by Brennon York) that provides an
automatically configured Maven instance with Zinc embedded in Spark.
E.g.:

./build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.3 package

It is hard to test changes like this across all environments, so
please give this a spin and report any issues on the Spark JIRA. It is
working correctly if you see the following message during compilation:

[INFO] Using zinc server for incremental compilation

Note that developers preferring their own Maven installation are
unaffected by this and can just ignore this new feature.

Cheers,
- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Resolved] (SPARK-4501) Create build/mvn to automatically download maven/zinc/scalac

2014-12-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4501.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Create build/mvn to automatically download maven/zinc/scalac
> 
>
> Key: SPARK-4501
> URL: https://issues.apache.org/jira/browse/SPARK-4501
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>        Reporter: Patrick Wendell
>Assignee: Brennon York
> Fix For: 1.3.0
>
>
> For a long time we've had the sbt/sbt and this works well for users who want 
> to build Spark with minimal dependencies (only Java). It would be nice to 
> generalize this to maven as well and have build/sbt and build/mvn, where 
> build/mvn was a script that downloaded Maven, Zinc, and Scala locally and set 
> them up correctly. This would be totally "opt in" and people using system 
> maven would be able to continue doing so.
> My sense is that very few maven users are currently using Zinc even though 
> from some basic tests I saw a huge improvement from using this. Also, having 
> a simple way to use Zinc would make it easier to use Maven on our jenkins 
> test machines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    8   9   10   11   12   13   14   15   16   17   >