[jira] [Resolved] (SPARK-5025) Write a guide for creating well-formed packages for Spark

2015-03-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5025.

Resolution: Won't Fix

I'm closing this as wont fix. There are now a bunch of community packages as 
examples, so I think people can just follow those examples.

> Write a guide for creating well-formed packages for Spark
> -
>
> Key: SPARK-5025
> URL: https://issues.apache.org/jira/browse/SPARK-5025
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> There are an increasing number of OSS projects providing utilities and 
> extensions to Spark. We should write a guide in the Spark docs that explains 
> how to create, package, and publish a third party Spark library. There are a 
> few issues here such as how to list your dependency on Spark, how to deal 
> with your own third party dependencies, etc. We should also cover how to do 
> this for Python libraries.
> In general, we should make it easy to build extension points against any of 
> Spark's API's (e.g. for new data sources, streaming receivers, ML algos, etc) 
> and self-publish libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: RDD resiliency -- does it keep state?

2015-03-27 Thread Patrick Wendell
If you invoke this, you will get at-least-once semantics on failure.
For instance, if a machine dies in the middle of executing the foreach
for a single partition, that will be re-executed on another machine.
It could even fully complete on one machine, but the machine dies
immediately before reporting the result back to the driver.

This means you need to make sure the side-effects are idempotent, or
use some transactional locking. Spark's own output operations, such as
saving to Hadoop, use such mechanisms. For instance, in the case of
Hadoop it uses the OutputCommitter classes.

- Patrick

On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos  wrote:
> Hi Spark group,
>
> We haven't been able to find clear descriptions of how Spark handles the
> resiliency of RDDs in relationship to executing actions with side-effects.
> If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect
> for each element in the RDD. If a partition goes down -- the resiliency
> rebuilds the data,  but did it keep track of how far it go in the
> partition's set of data or will it start from the beginning again. So will
> it do at-least-once execution of foreach closures or at-most-once?
>
> thanks,
> Michal

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Commented] (SPARK-6544) Problem with Avro and Kryo Serialization

2015-03-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384459#comment-14384459
 ] 

Patrick Wendell commented on SPARK-6544:


Back-ported to 1.3.1 per discussion on issue.

> Problem with Avro and Kryo Serialization
> 
>
> Key: SPARK-6544
> URL: https://issues.apache.org/jira/browse/SPARK-6544
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Dean Chen
> Fix For: 1.3.1, 1.4.0
>
>
> We're running in to the following bug with Avro 1.7.6 and the Kryo serializer 
> causing jobs to fail
> https://issues.apache.org/jira/browse/AVRO-1476?focusedCommentId=13999249&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13999249
> PR here
> https://github.com/apache/spark/pull/5193



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6544) Problem with Avro and Kryo Serialization

2015-03-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6544:
---
Fix Version/s: 1.3.1

> Problem with Avro and Kryo Serialization
> 
>
> Key: SPARK-6544
> URL: https://issues.apache.org/jira/browse/SPARK-6544
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Dean Chen
> Fix For: 1.3.1, 1.4.0
>
>
> We're running in to the following bug with Avro 1.7.6 and the Kryo serializer 
> causing jobs to fail
> https://issues.apache.org/jira/browse/AVRO-1476?focusedCommentId=13999249&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13999249
> PR here
> https://github.com/apache/spark/pull/5193



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark 1.3 Source - Github and source tar does not seem to match

2015-03-27 Thread Patrick Wendell
The source code should match the Spark commit
4aaf48d46d13129f0f9bdafd771dd80fe568a7dc. Do you see any differences?

On Fri, Mar 27, 2015 at 11:28 AM, Manoj Samel  wrote:
> While looking into a issue, I noticed that the source displayed on Github
> site does not matches the downloaded tar for 1.3
>
> Thoughts ?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Commented] (SPARK-6561) Add partition support in saveAsParquet

2015-03-26 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383413#comment-14383413
 ] 

Patrick Wendell commented on SPARK-6561:


FYI - I just removed "Affects Version's" since that is only for bugs (to 
indicate which version has the bug).

> Add partition support in saveAsParquet
> --
>
> Key: SPARK-6561
> URL: https://issues.apache.org/jira/browse/SPARK-6561
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jianshi Huang
>
> Now ParquetRelation2 supports automatic partition discovery which is very 
> nice. 
> When we save a DataFrame into Parquet files, we also want to have it 
> partitioned.
> The proposed API looks like this:
> {code}
> def saveAsParquetFile(path: String, partitionColumns: Seq[String])
> {code}
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6561) Add partition support in saveAsParquet

2015-03-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6561:
---
Affects Version/s: (was: 1.3.1)
   (was: 1.3.0)

> Add partition support in saveAsParquet
> --
>
> Key: SPARK-6561
> URL: https://issues.apache.org/jira/browse/SPARK-6561
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jianshi Huang
>
> Now ParquetRelation2 supports automatic partition discovery which is very 
> nice. 
> When we save a DataFrame into Parquet files, we also want to have it 
> partitioned.
> The proposed API looks like this:
> {code}
> def saveAsParquetFile(path: String, partitionColumns: Seq[String])
> {code}
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6405) Spark Kryo buffer should be forced to be max. 2GB

2015-03-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6405.

Resolution: Fixed
  Assignee: Matthew Cheah

> Spark Kryo buffer should be forced to be max. 2GB
> -
>
> Key: SPARK-6405
> URL: https://issues.apache.org/jira/browse/SPARK-6405
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
>Assignee: Matthew Cheah
> Fix For: 1.4.0
>
>
> Kryo buffers used in serialization are backed by Java byte arrays, which have 
> a maximum size of 2GB. However, we blindly set the size without worrying 
> about numeric overflow or regards to the maximum array size. We should 
> enforce the maximum buffer size to be 2GB and warn the user when they have 
> exceeded that amount.
> I'm open to the idea of flat-out failing the initialization of the Spark 
> Context if the buffer size is over 2GB, but I'm afraid that could break 
> backwards-compatability... although one can argue that the user had incorrect 
> buffer sizes in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6549) Spark console logger logs to stderr by default

2015-03-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6549.

Resolution: Won't Fix

I think this is a wont fix due to compatibility issues. If I'm wrong please 
feel free to re-open.

> Spark console logger logs to stderr by default
> --
>
> Key: SPARK-6549
> URL: https://issues.apache.org/jira/browse/SPARK-6549
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.2.0
>Reporter: Pavel Sakun
>Priority: Trivial
>  Labels: log4j
>
> Spark's console logger is configured to log message with INFO level to stderr 
> by default while it should be stdout:
> https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/log4j-defaults.properties
> https://github.com/apache/spark/blob/master/conf/log4j.properties.template



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Patrick Wendell
I think we have a version of mapPartitions that allows you to tell
Spark the partitioning is preserved:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639

We could also add a map function that does same. Or you can just write
your map using an iterator.

- Patrick

On Thu, Mar 26, 2015 at 3:07 PM, Jonathan Coveney  wrote:
> This is just a deficiency of the api, imo. I agree: mapValues could
> definitely be a function (K, V)=>V1. The option isn't set by the function,
> it's on the RDD. So you could look at the code and do this.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
>
>  def mapValues[U](f: V => U): RDD[(K, U)] = {
> val cleanF = self.context.clean(f)
> new MapPartitionsRDD[(K, U), (K, V)](self,
>   (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
>   preservesPartitioning = true)
>   }
>
> What you want:
>
>  def mapValues[U](f: (K, V) => U): RDD[(K, U)] = {
> val cleanF = self.context.clean(f)
> new MapPartitionsRDD[(K, U), (K, V)](self,
>   (context, pid, iter) => iter.map { case t@(k, _) => (k, cleanF(t)) },
>   preservesPartitioning = true)
>   }
>
> One of the nice things about spark is that making such new operators is very
> easy :)
>
> 2015-03-26 17:54 GMT-04:00 Zhan Zhang :
>
>> Thanks Jonathan. You are right regarding rewrite the example.
>>
>> I mean providing such option to developer so that it is controllable. The
>> example may seems silly, and I don't know the use cases.
>>
>> But for example, if I also want to operate both the key and value part to
>> generate some new value with keeping key part untouched. Then mapValues may
>> not be able to  do this.
>>
>> Changing the code to allow this is trivial, but I don't know whether there
>> is some special reason behind this.
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>>
>>
>>
>> On Mar 26, 2015, at 2:49 PM, Jonathan Coveney  wrote:
>>
>> I believe if you do the following:
>>
>>
>> sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString
>>
>> (8) MapPartitionsRDD[34] at reduceByKey at :23 []
>>  |  MapPartitionsRDD[33] at mapValues at :23 []
>>  |  ShuffledRDD[32] at reduceByKey at :23 []
>>  +-(8) MapPartitionsRDD[31] at map at :23 []
>> |  ParallelCollectionRDD[30] at parallelize at :23 []
>>
>> The difference is that spark has no way to know that your map closure
>> doesn't change the key. if you only use mapValues, it does. Pretty cool that
>> they optimized that :)
>>
>> 2015-03-26 17:44 GMT-04:00 Zhan Zhang :
>>>
>>> Hi Folks,
>>>
>>> Does anybody know what is the reason not allowing preserverPartitioning
>>> in RDD.map? Do I miss something here?
>>>
>>> Following example involves two shuffles. I think if preservePartitioning
>>> is allowed, we can avoid the second one, right?
>>>
>>>  val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
>>>  val r2 = r1.map((_, 1))
>>>  val r3 = r2.reduceByKey(_+_)
>>>  val r4 = r3.map(x=>(x._1, x._2 + 1))
>>>  val r5 = r4.reduceByKey(_+_)
>>>  r5.collect.foreach(println)
>>>
>>> scala> r5.toDebugString
>>> res2: String =
>>> (8) ShuffledRDD[4] at reduceByKey at :29 []
>>>  +-(8) MapPartitionsRDD[3] at map at :27 []
>>> |  ShuffledRDD[2] at reduceByKey at :25 []
>>> +-(8) MapPartitionsRDD[1] at map at :23 []
>>>|  ParallelCollectionRDD[0] at parallelize at :21 []
>>>
>>> Thanks.
>>>
>>> Zhan Zhang
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Patrick Wendell
I think we have a version of mapPartitions that allows you to tell
Spark the partitioning is preserved:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639

We could also add a map function that does same. Or you can just write
your map using an iterator.

- Patrick

On Thu, Mar 26, 2015 at 3:07 PM, Jonathan Coveney  wrote:
> This is just a deficiency of the api, imo. I agree: mapValues could
> definitely be a function (K, V)=>V1. The option isn't set by the function,
> it's on the RDD. So you could look at the code and do this.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
>
>  def mapValues[U](f: V => U): RDD[(K, U)] = {
> val cleanF = self.context.clean(f)
> new MapPartitionsRDD[(K, U), (K, V)](self,
>   (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
>   preservesPartitioning = true)
>   }
>
> What you want:
>
>  def mapValues[U](f: (K, V) => U): RDD[(K, U)] = {
> val cleanF = self.context.clean(f)
> new MapPartitionsRDD[(K, U), (K, V)](self,
>   (context, pid, iter) => iter.map { case t@(k, _) => (k, cleanF(t)) },
>   preservesPartitioning = true)
>   }
>
> One of the nice things about spark is that making such new operators is very
> easy :)
>
> 2015-03-26 17:54 GMT-04:00 Zhan Zhang :
>
>> Thanks Jonathan. You are right regarding rewrite the example.
>>
>> I mean providing such option to developer so that it is controllable. The
>> example may seems silly, and I don't know the use cases.
>>
>> But for example, if I also want to operate both the key and value part to
>> generate some new value with keeping key part untouched. Then mapValues may
>> not be able to  do this.
>>
>> Changing the code to allow this is trivial, but I don't know whether there
>> is some special reason behind this.
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>>
>>
>>
>> On Mar 26, 2015, at 2:49 PM, Jonathan Coveney  wrote:
>>
>> I believe if you do the following:
>>
>>
>> sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString
>>
>> (8) MapPartitionsRDD[34] at reduceByKey at :23 []
>>  |  MapPartitionsRDD[33] at mapValues at :23 []
>>  |  ShuffledRDD[32] at reduceByKey at :23 []
>>  +-(8) MapPartitionsRDD[31] at map at :23 []
>> |  ParallelCollectionRDD[30] at parallelize at :23 []
>>
>> The difference is that spark has no way to know that your map closure
>> doesn't change the key. if you only use mapValues, it does. Pretty cool that
>> they optimized that :)
>>
>> 2015-03-26 17:44 GMT-04:00 Zhan Zhang :
>>>
>>> Hi Folks,
>>>
>>> Does anybody know what is the reason not allowing preserverPartitioning
>>> in RDD.map? Do I miss something here?
>>>
>>> Following example involves two shuffles. I think if preservePartitioning
>>> is allowed, we can avoid the second one, right?
>>>
>>>  val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
>>>  val r2 = r1.map((_, 1))
>>>  val r3 = r2.reduceByKey(_+_)
>>>  val r4 = r3.map(x=>(x._1, x._2 + 1))
>>>  val r5 = r4.reduceByKey(_+_)
>>>  r5.collect.foreach(println)
>>>
>>> scala> r5.toDebugString
>>> res2: String =
>>> (8) ShuffledRDD[4] at reduceByKey at :29 []
>>>  +-(8) MapPartitionsRDD[3] at map at :27 []
>>> |  ShuffledRDD[2] at reduceByKey at :25 []
>>> +-(8) MapPartitionsRDD[1] at map at :23 []
>>>|  ParallelCollectionRDD[0] at parallelize at :21 []
>>>
>>> Thanks.
>>>
>>> Zhan Zhang
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Great - that's even easier. Maybe we could have a simple example in the doc.

On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza  wrote:
> Regarding Patrick's question, you can just do "new Configuration(oldConf)"
> to get a cloned Configuration object and add any new properties to it.
>
> -Sandy
>
> On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid  wrote:
>
>> Hi Nick,
>>
>> I don't remember the exact details of these scenarios, but I think the user
>> wanted a lot more control over how the files got grouped into partitions,
>> to group the files together by some arbitrary function.  I didn't think
>> that was possible w/ CombineFileInputFormat, but maybe there is a way?
>>
>> thanks
>>
>> On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath 
>> wrote:
>>
>> > Imran, on your point to read multiple files together in a partition, is
>> it
>> > not simpler to use the approach of copy Hadoop conf and set per-RDD
>> > settings for min split to control the input size per partition, together
>> > with something like CombineFileInputFormat?
>> >
>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
>> > wrote:
>> >
>> > > I think this would be a great addition, I totally agree that you need
>> to
>> > be
>> > > able to set these at a finer context than just the SparkContext.
>> > >
>> > > Just to play devil's advocate, though -- the alternative is for you
>> just
>> > > subclass HadoopRDD yourself, or make a totally new RDD, and then you
>> > could
>> > > expose whatever you need.  Why is this solution better?  IMO the
>> criteria
>> > > are:
>> > > (a) common operations
>> > > (b) error-prone / difficult to implement
>> > > (c) non-obvious, but important for performance
>> > >
>> > > I think this case fits (a) & (c), so I think its still worthwhile.  But
>> > its
>> > > also worth asking whether or not its too difficult for a user to extend
>> > > HadoopRDD right now.  There have been several cases in the past week
>> > where
>> > > we've suggested that a user should read from hdfs themselves (eg., to
>> > read
>> > > multiple files together in one partition) -- with*out* reusing the code
>> > in
>> > > HadoopRDD, though they would lose things like the metric tracking &
>> > > preferred locations you get from HadoopRDD.  Does HadoopRDD need to
>> some
>> > > refactoring to make that easier to do?  Or do we just need a good
>> > example?
>> > >
>> > > Imran
>> > >
>> > > (sorry for hijacking your thread, Koert)
>> > >
>> > >
>> > >
>> > > On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
>> > wrote:
>> > >
>> > > > see email below. reynold suggested i send it to dev instead of user
>> > > >
>> > > > -- Forwarded message --
>> > > > From: Koert Kuipers 
>> > > > Date: Mon, Mar 23, 2015 at 4:36 PM
>> > > > Subject: hadoop input/output format advanced control
>> > > > To: "u...@spark.apache.org" 
>> > > >
>> > > >
>> > > > currently its pretty hard to control the Hadoop Input/Output formats
>> > used
>> > > > in Spark. The conventions seems to be to add extra parameters to all
>> > > > methods and then somewhere deep inside the code (for example in
>> > > > PairRDDFunctions.saveAsHadoopFile) all these parameters get
>> translated
>> > > into
>> > > > settings on the Hadoop Configuration object.
>> > > >
>> > > > for example for compression i see "codec: Option[Class[_ <:
>> > > > CompressionCodec]] = None" added to a bunch of methods.
>> > > >
>> > > > how scalable is this solution really?
>> > > >
>> > > > for example i need to read from a hadoop dataset and i dont want the
>> > > input
>> > > > (part) files to get split up. the way to do this is to set
>> > > > "mapred.min.split.size". now i dont want to set this at the level of
>> > the
>> > > > SparkContext (which can be done), since i dont want it to apply to
>> > input
>> > > > formats in general. i want it to apply to just this one specific
>> input
>> > > > dataset i need to read. which leaves me with no options currently. i
>> > > could
>> > > > go add yet another input parameter to all the methods
>> > > > (SparkContext.textFile, SparkContext.hadoopFile,
>> > SparkContext.objectFile,
>> > > > etc.). but that seems ineffective.
>> > > >
>> > > > why can we not expose a Map[String, String] or some other generic way
>> > to
>> > > > manipulate settings for hadoop input/output formats? it would require
>> > > > adding one more parameter to all methods to deal with hadoop
>> > input/output
>> > > > formats, but after that its done. one parameter to rule them all
>> > > >
>> > > > then i could do:
>> > > > val x = sc.textFile("/some/path", formatSettings =
>> > > > Map("mapred.min.split.size" -> "12345"))
>> > > >
>> > > > or
>> > > > rdd.saveAsTextFile("/some/path, formatSettings =
>> > > > Map(mapred.output.compress" -> "true",
>> > "mapred.output.compression.codec"
>> > > ->
>> > > > "somecodec"))
>> > > >
>> > >
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additio

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Yeah I agree that might have been nicer, but I think for consistency
with the input API's maybe we should do the same thing. We can also
give an example of how to clone sc.hadoopConfiguration and then set
some new values:

val conf = sc.hadoopConfiguration.clone()
  .set("k1", "v1")
  .set("k2", "v2")

val rdd = sc.objectFile(..., conf)

I have no idea if that's the correct syntax, but something like that
seems almost as easy as passing a hashmap with deltas.

- Patrick

On Wed, Mar 25, 2015 at 6:34 AM, Koert Kuipers  wrote:
> my personal preference would be something like a Map[String, String] that
> only reflects the changes you want to make the Configuration for the given
> input/output format (so system wide defaults continue to come from
> sc.hadoopConfiguration), similarly to what cascading/scalding did, but am
> arbitrary Configuration will work too.
>
> i will make a jira and pullreq when i have some time.
>
>
>
> On Wed, Mar 25, 2015 at 1:23 AM, Patrick Wendell  wrote:
>>
>> I see - if you look, in the saving functions we have the option for
>> the user to pass an arbitrary Configuration.
>>
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894
>>
>> It seems fine to have the same option for the loading functions, if
>> it's easy to just pass this config into the input format.
>>
>>
>>
>> On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers  wrote:
>> > the (compression) codec parameter that is now part of many saveAs...
>> > methods
>> > came from a similar need. see SPARK-763
>> > hadoop has many options like this. you either going to have to allow
>> > many
>> > more of these optional arguments to all the methods that read from
>> > hadoop
>> > inputformats and write to hadoop outputformats, or you force people to
>> > re-create these methods using HadoopRDD, i think (if thats even
>> > possible).
>> >
>> > On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers 
>> > wrote:
>> >>
>> >> i would like to use objectFile with some tweaks to the hadoop conf.
>> >> currently there is no way to do that, except recreating objectFile
>> >> myself.
>> >> and some of the code objectFile uses i have no access to, since its
>> >> private
>> >> to spark.
>> >>
>> >>
>> >> On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell 
>> >> wrote:
>> >>>
>> >>> Yeah - to Nick's point, I think the way to do this is to pass in a
>> >>> custom conf when you create a Hadoop RDD (that's AFAIK why the conf
>> >>> field is there). Is there anything you can't do with that feature?
>> >>>
>> >>> On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
>> >>>  wrote:
>> >>> > Imran, on your point to read multiple files together in a partition,
>> >>> > is
>> >>> > it
>> >>> > not simpler to use the approach of copy Hadoop conf and set per-RDD
>> >>> > settings for min split to control the input size per partition,
>> >>> > together
>> >>> > with something like CombineFileInputFormat?
>> >>> >
>> >>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
>> >>> > wrote:
>> >>> >
>> >>> >> I think this would be a great addition, I totally agree that you
>> >>> >> need
>> >>> >> to be
>> >>> >> able to set these at a finer context than just the SparkContext.
>> >>> >>
>> >>> >> Just to play devil's advocate, though -- the alternative is for you
>> >>> >> just
>> >>> >> subclass HadoopRDD yourself, or make a totally new RDD, and then
>> >>> >> you
>> >>> >> could
>> >>> >> expose whatever you need.  Why is this solution better?  IMO the
>> >>> >> criteria
>> >>> >> are:
>> >>> >> (a) common operations
>> >>> >> (b) error-prone / difficult to implement
>> >>> >> (c) non-obvious, but important for performance
>> >>> >>
>> >>> >> I think this case fits (a) & (c), so I think its still worthwhile.
>> >>> >> But its
>> >>> >> also worth asking whether or not

[jira] [Commented] (SPARK-6481) Set "In Progress" when a PR is opened for an issue

2015-03-25 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380432#comment-14380432
 ] 

Patrick Wendell commented on SPARK-6481:


Hey All,

One issue here, (I think?) right now unfortunately no users have sufficient 
permission to make the state change into "In Progress" because of the way that 
the JIRA is currently set up. Currently we don't expose the "Start Progress" 
button on any screen, so I think that makes it unavailable from the API call. 
At least, I just used my own credentials and I was not able to see the "Start 
Progress" transition on a JIRA, even though AFAIK I have the highest 
permissions possible.

The reason we do this I think was that we wanted to restrict assignment of 
JIRA's to the committership for now and the "Start Progress" button 
automatically assigns issues to a new person clicking it.

In my ideal world it works such that typical users cannot modify this state 
transition and it is only possible to put it in progress via a github pull 
request. If there is such a permission scheme that allows that, then we should 
see about asking ASF to enable it for our JIRA.

In terms of assignment, I'd say for now just leave the assignment as it was 
before.

> Set "In Progress" when a PR is opened for an issue
> --
>
> Key: SPARK-6481
> URL: https://issues.apache.org/jira/browse/SPARK-6481
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Michael Armbrust
>Assignee: Nicholas Chammas
>
> [~pwendell] and I are not sure if this is possible, but it would be really 
> helpful if the JIRA status was updated to "In Progress" when we do the 
> linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6520) Kyro serialization broken in the shell

2015-03-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6520:
---
Component/s: Spark Shell

> Kyro serialization broken in the shell
> --
>
> Key: SPARK-6520
> URL: https://issues.apache.org/jira/browse/SPARK-6520
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.3.0
>Reporter: Aaron Defazio
>
> If I start spark as follows:
> {quote}
> ~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf 
> "spark.serializer=org.apache.spark.serializer.KryoSerializer"
> {quote}
> Then using :paste, run 
> {quote}
> case class Example(foo : String, bar : String)
> val ex = sc.parallelize(List(Example("foo1", "bar1"), Example("foo2", 
> "bar2"))).collect()
> {quote}
> I get the error:
> {quote}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.io.IOException: 
> com.esotericsoftware.kryo.KryoException: Error constructing instance of 
> class: $line3.$read
> Serialization trace:
> $VAL10 ($iwC)
> $outer ($iwC$$iwC)
> $outer ($iwC$$iwC$Example)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140)
>   at 
> org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> {quote}
> As far as I can tell, when using :paste, Kyro serialization doesn't work for 
> classes defined in within the same paste. It does work when the statements 
> are entered without paste.
> This issue seems serious to me, since Kyro serialization is virtually 
> mandatory for performance (20x slower with default serialization on my 
> problem), and I'm assuming feature parity between spark-shell and 
> spark-submit is a goal.
> Note that this is different from SPARK-6497, which covers the case when Kyro 
> is set to require registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6499) pyspark: printSchema command on a dataframe hangs

2015-03-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6499:
---
Component/s: PySpark

> pyspark: printSchema command on a dataframe hangs
> -
>
> Key: SPARK-6499
> URL: https://issues.apache.org/jira/browse/SPARK-6499
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: cynepia
> Attachments: airports.json, pyspark.txt
>
>
> 1. A printSchema() on a dataframe fails to respond even after a lot of time
> Will attach the console logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6504) Cannot read Parquet files generated from different versions at once

2015-03-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6504:
---
Component/s: SQL

> Cannot read Parquet files generated from different versions at once
> ---
>
> Key: SPARK-6504
> URL: https://issues.apache.org/jira/browse/SPARK-6504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Marius Soutier
>
> When trying to read Parquet files generated by Spark 1.1.1 and 1.2.1 at the 
> same time via 
> `sqlContext.parquetFile("fileFrom1.1.parqut,fileFrom1.2.parquet")` an 
> exception occurs:
> could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has 
> conflicting values: 
> [{"type":"struct","fields":[{"name":"date","type":"string","nullable":true,"metadata":{}},{"name":"account","type":"string","nullable":true,"metadata":{}},{"name":"impressions","type":"long","nullable":false,"metadata":{}},{"name":"cost","type":"double","nullable":false,"metadata":{}},{"name":"clicks","type":"long","nullable":false,"metadata":{}},{"name":"conversions","type":"long","nullable":false,"metadata":{}},{"name":"orderValue","type":"double","nullable":false,"metadata":{}}]},
>  StructType(List(StructField(date,StringType,true), 
> StructField(account,StringType,true), 
> StructField(impressions,LongType,false), StructField(cost,DoubleType,false), 
> StructField(clicks,LongType,false), StructField(conversions,LongType,false), 
> StructField(orderValue,DoubleType,false)))]
> The Schema is exactly equal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: hadoop input/output format advanced control

2015-03-24 Thread Patrick Wendell
I see - if you look, in the saving functions we have the option for
the user to pass an arbitrary Configuration.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894

It seems fine to have the same option for the loading functions, if
it's easy to just pass this config into the input format.



On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers  wrote:
> the (compression) codec parameter that is now part of many saveAs... methods
> came from a similar need. see SPARK-763
> hadoop has many options like this. you either going to have to allow many
> more of these optional arguments to all the methods that read from hadoop
> inputformats and write to hadoop outputformats, or you force people to
> re-create these methods using HadoopRDD, i think (if thats even possible).
>
> On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers  wrote:
>>
>> i would like to use objectFile with some tweaks to the hadoop conf.
>> currently there is no way to do that, except recreating objectFile myself.
>> and some of the code objectFile uses i have no access to, since its private
>> to spark.
>>
>>
>> On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell 
>> wrote:
>>>
>>> Yeah - to Nick's point, I think the way to do this is to pass in a
>>> custom conf when you create a Hadoop RDD (that's AFAIK why the conf
>>> field is there). Is there anything you can't do with that feature?
>>>
>>> On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
>>>  wrote:
>>> > Imran, on your point to read multiple files together in a partition, is
>>> > it
>>> > not simpler to use the approach of copy Hadoop conf and set per-RDD
>>> > settings for min split to control the input size per partition,
>>> > together
>>> > with something like CombineFileInputFormat?
>>> >
>>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
>>> > wrote:
>>> >
>>> >> I think this would be a great addition, I totally agree that you need
>>> >> to be
>>> >> able to set these at a finer context than just the SparkContext.
>>> >>
>>> >> Just to play devil's advocate, though -- the alternative is for you
>>> >> just
>>> >> subclass HadoopRDD yourself, or make a totally new RDD, and then you
>>> >> could
>>> >> expose whatever you need.  Why is this solution better?  IMO the
>>> >> criteria
>>> >> are:
>>> >> (a) common operations
>>> >> (b) error-prone / difficult to implement
>>> >> (c) non-obvious, but important for performance
>>> >>
>>> >> I think this case fits (a) & (c), so I think its still worthwhile.
>>> >> But its
>>> >> also worth asking whether or not its too difficult for a user to
>>> >> extend
>>> >> HadoopRDD right now.  There have been several cases in the past week
>>> >> where
>>> >> we've suggested that a user should read from hdfs themselves (eg., to
>>> >> read
>>> >> multiple files together in one partition) -- with*out* reusing the
>>> >> code in
>>> >> HadoopRDD, though they would lose things like the metric tracking &
>>> >> preferred locations you get from HadoopRDD.  Does HadoopRDD need to
>>> >> some
>>> >> refactoring to make that easier to do?  Or do we just need a good
>>> >> example?
>>> >>
>>> >> Imran
>>> >>
>>> >> (sorry for hijacking your thread, Koert)
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
>>> >> wrote:
>>> >>
>>> >> > see email below. reynold suggested i send it to dev instead of user
>>> >> >
>>> >> > -- Forwarded message --
>>> >> > From: Koert Kuipers 
>>> >> > Date: Mon, Mar 23, 2015 at 4:36 PM
>>> >> > Subject: hadoop input/output format advanced control
>>> >> > To: "u...@spark.apache.org" 
>>> >> >
>>> >> >
>>> >> > currently its pretty hard to control the Hadoop Input/Output formats
>>> >> > used
>>> >> > in Spark. The conventions seems to be to add extra parameters to all
>>> >> > methods and then somewhere deep 

Re: 1.3 Hadoop File System problem

2015-03-24 Thread Patrick Wendell
Hey Jim,

Thanks for reporting this. Can you give a small end-to-end code
example that reproduces it? If so, we can definitely fix it.

- Patrick

On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll  wrote:
>
> I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to
> find the s3 hadoop file system.
>
> I get the "java.lang.IllegalArgumentException: Wrong FS: s3://path to my
> file], expected: file:///" when I try to save a parquet file. This worked in
> 1.2.1.
>
> Has anyone else seen this?
>
> I'm running spark using "local[8]" so it's all internal. These are actually
> unit tests in our app that are failing now.
>
> Thanks.
> Jim
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Experience using binary packages on various Hadoop distros

2015-03-24 Thread Patrick Wendell
We can probably better explain that if you are not using HDFS or YARN,
you can download any binary.

However, my question was about if the existing binaries do not work
well with newer Hadoop versions, which I heard some people suggest but
I'm looking for more specific issues.

On Tue, Mar 24, 2015 at 4:16 PM, Jey Kottalam  wrote:
> Could we gracefully fallback to an in-tree Hadoop binary (e.g. 1.0.4)
> in that case? I think many new Spark users are confused about why
> Spark has anything to do with Hadoop, e.g. I could see myself being
> confused when the download page asks me to select a "package type". I
> know that what I want is not "source code", but I'd have no idea how
> to choose amongst the apparently multiple types of binaries.
>
> On Tue, Mar 24, 2015 at 2:28 PM, Matei Zaharia  
> wrote:
>> Just a note, one challenge with the BYOH version might be that users who 
>> download that can't run in local mode without also having Hadoop. But if we 
>> describe it correctly then hopefully it's okay.
>>
>> Matei
>>
>>> On Mar 24, 2015, at 3:05 PM, Patrick Wendell  wrote:
>>>
>>> Hey All,
>>>
>>> For a while we've published binary packages with different Hadoop
>>> client's pre-bundled. We currently have three interfaces to a Hadoop
>>> cluster (a) the HDFS client (b) the YARN client (c) the Hive client.
>>>
>>> Because (a) and (b) are supposed to be backwards compatible
>>> interfaces. My working assumption was that for the most part (modulo
>>> Hive) our packages work with *newer* Hadoop versions. For instance,
>>> our Hadoop 2.4 package should work with HDFS 2.6 and YARN 2.6.
>>> However, I have heard murmurings that these are not compatible in
>>> practice.
>>>
>>> So I have three questions I'd like to put out to the community:
>>>
>>> 1. Have people had difficulty using 2.4 packages with newer Hadoop
>>> versions? If so, what specific incompatibilities have you hit?
>>> 2. Have people had issues using our binary Hadoop packages in general
>>> with commercial or Apache Hadoop distro's, such that you have to build
>>> from source?
>>> 3. How would people feel about publishing a "bring your own Hadoop"
>>> binary, where you are required to point us to a local Hadoop
>>> distribution by setting HADOOP_HOME? This might be better for ensuring
>>> full compatibility:
>>> https://issues.apache.org/jira/browse/SPARK-6511
>>>
>>> - Patrick
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Experience using binary packages on various Hadoop distros

2015-03-24 Thread Patrick Wendell
Hey All,

For a while we've published binary packages with different Hadoop
client's pre-bundled. We currently have three interfaces to a Hadoop
cluster (a) the HDFS client (b) the YARN client (c) the Hive client.

Because (a) and (b) are supposed to be backwards compatible
interfaces. My working assumption was that for the most part (modulo
Hive) our packages work with *newer* Hadoop versions. For instance,
our Hadoop 2.4 package should work with HDFS 2.6 and YARN 2.6.
However, I have heard murmurings that these are not compatible in
practice.

So I have three questions I'd like to put out to the community:

1. Have people had difficulty using 2.4 packages with newer Hadoop
versions? If so, what specific incompatibilities have you hit?
2. Have people had issues using our binary Hadoop packages in general
with commercial or Apache Hadoop distro's, such that you have to build
from source?
3. How would people feel about publishing a "bring your own Hadoop"
binary, where you are required to point us to a local Hadoop
distribution by setting HADOOP_HOME? This might be better for ensuring
full compatibility:
https://issues.apache.org/jira/browse/SPARK-6511

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: hadoop input/output format advanced control

2015-03-24 Thread Patrick Wendell
Yeah - to Nick's point, I think the way to do this is to pass in a
custom conf when you create a Hadoop RDD (that's AFAIK why the conf
field is there). Is there anything you can't do with that feature?

On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
 wrote:
> Imran, on your point to read multiple files together in a partition, is it
> not simpler to use the approach of copy Hadoop conf and set per-RDD
> settings for min split to control the input size per partition, together
> with something like CombineFileInputFormat?
>
> On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid  wrote:
>
>> I think this would be a great addition, I totally agree that you need to be
>> able to set these at a finer context than just the SparkContext.
>>
>> Just to play devil's advocate, though -- the alternative is for you just
>> subclass HadoopRDD yourself, or make a totally new RDD, and then you could
>> expose whatever you need.  Why is this solution better?  IMO the criteria
>> are:
>> (a) common operations
>> (b) error-prone / difficult to implement
>> (c) non-obvious, but important for performance
>>
>> I think this case fits (a) & (c), so I think its still worthwhile.  But its
>> also worth asking whether or not its too difficult for a user to extend
>> HadoopRDD right now.  There have been several cases in the past week where
>> we've suggested that a user should read from hdfs themselves (eg., to read
>> multiple files together in one partition) -- with*out* reusing the code in
>> HadoopRDD, though they would lose things like the metric tracking &
>> preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
>> refactoring to make that easier to do?  Or do we just need a good example?
>>
>> Imran
>>
>> (sorry for hijacking your thread, Koert)
>>
>>
>>
>> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers  wrote:
>>
>> > see email below. reynold suggested i send it to dev instead of user
>> >
>> > -- Forwarded message --
>> > From: Koert Kuipers 
>> > Date: Mon, Mar 23, 2015 at 4:36 PM
>> > Subject: hadoop input/output format advanced control
>> > To: "u...@spark.apache.org" 
>> >
>> >
>> > currently its pretty hard to control the Hadoop Input/Output formats used
>> > in Spark. The conventions seems to be to add extra parameters to all
>> > methods and then somewhere deep inside the code (for example in
>> > PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
>> into
>> > settings on the Hadoop Configuration object.
>> >
>> > for example for compression i see "codec: Option[Class[_ <:
>> > CompressionCodec]] = None" added to a bunch of methods.
>> >
>> > how scalable is this solution really?
>> >
>> > for example i need to read from a hadoop dataset and i dont want the
>> input
>> > (part) files to get split up. the way to do this is to set
>> > "mapred.min.split.size". now i dont want to set this at the level of the
>> > SparkContext (which can be done), since i dont want it to apply to input
>> > formats in general. i want it to apply to just this one specific input
>> > dataset i need to read. which leaves me with no options currently. i
>> could
>> > go add yet another input parameter to all the methods
>> > (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
>> > etc.). but that seems ineffective.
>> >
>> > why can we not expose a Map[String, String] or some other generic way to
>> > manipulate settings for hadoop input/output formats? it would require
>> > adding one more parameter to all methods to deal with hadoop input/output
>> > formats, but after that its done. one parameter to rule them all
>> >
>> > then i could do:
>> > val x = sc.textFile("/some/path", formatSettings =
>> > Map("mapred.min.split.size" -> "12345"))
>> >
>> > or
>> > rdd.saveAsTextFile("/some/path, formatSettings =
>> > Map(mapred.output.compress" -> "true", "mapred.output.compression.codec"
>> ->
>> > "somecodec"))
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Any guidance on when to back port and how far?

2015-03-24 Thread Patrick Wendell
My philosophy has been basically what you suggested, Sean. One thing
you didn't mention though is if a bug fix seems complicated, I will
think very hard before back-porting it. This is because "fixes" can
introduce their own new bugs, in some cases worse than the original
issue. It's really bad to have some upgrade to a patch release and see
a regression - with our current approach this almost never happens.

I will usually try to backport up to N-2, if it can be back-ported
reasonably easily (for instance, with minor or no code changes). The
reason I do this is that vendors do end up supporting older versions,
and it's nice for them if some committer has backported a fix that
they can then pull in, even if we never ship it.

In terms of doing older maintenance releases, this one I think we
should do according to severity of issues (for instance, if there is a
security issue) or based on general command from the community. I
haven't initiated many 1.X.2 releases recently because I didn't see
huge demand. However, personally I don't mind doing these if there is
a lot of demand, at least for releases where ".0" has gone out in the
last six months.

On Tue, Mar 24, 2015 at 11:23 AM, Michael Armbrust
 wrote:
> Two other criteria that I use when deciding what to backport:
>  - Is it a regression from a previous minor release?  I'm much more likely
> to backport fixes in this case, as I'd love for most people to stay up to
> date.
>  - How scary is the change?  I think the primary goal is stability of the
> maintenance branches.  When I am confident that something is isolated and
> unlikely to break things (i.e. I'm fixing a confusing error message), then
> i'm much more likely to backport it.
>
> Regarding the length of time to continue backporting, I mostly don't
> backport to N-1, but this is partially because SQL is changing too fast for
> that to generally be useful.  These old branches usually only get attention
> from me when there is an explicit request.
>
> I'd love to hear more feedback from others.
>
> Michael
>
> On Tue, Mar 24, 2015 at 6:13 AM, Sean Owen  wrote:
>
>> So far, my rule of thumb has been:
>>
>> - Don't back-port new features or improvements in general, only bug fixes
>> - Don't back-port minor bug fixes
>> - Back-port bug fixes that seem important enough to not wait for the
>> next minor release
>> - Back-port site doc changes to the release most likely to go out
>> next, to make it a part of the next site publish
>>
>> But, how far should back-ports go, in general? If the last minor
>> release was 1.N, then to branch 1.N surely. Farther back is a question
>> of expectation for support of past minor releases. Given the pace of
>> change and time available, I assume there's not much support for
>> continuing to use release 1.(N-1) and very little for 1.(N-2).
>>
>> Concretely: does anyone expect a 1.1.2 release ever? a 1.2.2 release?
>> It'd be good to hear the received wisdom explicitly.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Created] (SPARK-6511) Publish "hadoop provided" build with instructions for different distros

2015-03-24 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-6511:
--

 Summary: Publish "hadoop provided" build with instructions for 
different distros
 Key: SPARK-6511
 URL: https://issues.apache.org/jira/browse/SPARK-6511
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell


Currently we publish a series of binaries with different Hadoop client jars. 
This mostly works, but some users have reported compatibility issues with 
different distributions.

One improvement moving forward might be to publish a binary build that simply 
asks you to set HADOOP_HOME to pick up the Hadoop client location. That way it 
would work across multiple distributions, even if they have subtle 
incompatibilities with upstream Hadoop.

I think a first step for this would be to produce such a build for the 
community and see how well it works. One potential issue is that our fancy 
excludes and dependency re-writing won't work with the simpler "append Hadoop's 
classpath to Spark". Also, how we deal with the Hive dependency is unclear, 
i.e. should we continue to bundle Spark's Hive (which has some fixes for 
dependency conflicts) or do we allow for linking against vanilla Hive at 
runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Patrick Wendell
Hey Yiannis,

If you just perform a count on each "name", "date" pair... can it succeed?
If so, can you do a count and then order by to find the largest one?

I'm wondering if there is a single pathologically large group here that is
somehow causing OOM.

Also, to be clear, you are getting GC limit warnings on the executors, not
the driver. Correct?

- Patrick

On Mon, Mar 23, 2015 at 10:21 AM, Martin Goodson 
wrote:

> Have you tried to repartition() your original data to make more partitions
> before you aggregate?
>
>
> --
> Martin Goodson  |  VP Data Science
> (0)20 3397 1240
> [image: Inline image 1]
>
> On Mon, Mar 23, 2015 at 4:12 PM, Yiannis Gkoufas 
> wrote:
>
>> Hi Yin,
>>
>> Yes, I have set spark.executor.memory to 8g and the worker memory to 16g
>> without any success.
>> I cannot figure out how to increase the number of mapPartitions tasks.
>>
>> Thanks a lot
>>
>> On 20 March 2015 at 18:44, Yin Huai  wrote:
>>
>>> spark.sql.shuffle.partitions only control the number of tasks in the
>>> second stage (the number of reducers). For your case, I'd say that the
>>> number of tasks in the first state (number of mappers) will be the number
>>> of files you have.
>>>
>>> Actually, have you changed "spark.executor.memory" (it controls the
>>> memory for an executor of your application)? I did not see it in your
>>> original email. The difference between worker memory and executor memory
>>> can be found at (
>>> http://spark.apache.org/docs/1.3.0/spark-standalone.html),
>>>
>>> SPARK_WORKER_MEMORY
>>> Total amount of memory to allow Spark applications to use on the
>>> machine, e.g. 1000m, 2g (default: total memory minus 1 GB); note that
>>> each application's individual memory is configured using its
>>> spark.executor.memory property.
>>>
>>>
>>> On Fri, Mar 20, 2015 at 9:25 AM, Yiannis Gkoufas 
>>> wrote:
>>>
 Actually I realized that the correct way is:

 sqlContext.sql("set spark.sql.shuffle.partitions=1000")

 but I am still experiencing the same behavior/error.

 On 20 March 2015 at 16:04, Yiannis Gkoufas 
 wrote:

> Hi Yin,
>
> the way I set the configuration is:
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.setConf("spark.sql.shuffle.partitions","1000");
>
> it is the correct way right?
> In the mapPartitions task (the first task which is launched), I get
> again the same number of tasks and again the same error. :(
>
> Thanks a lot!
>
> On 19 March 2015 at 17:40, Yiannis Gkoufas 
> wrote:
>
>> Hi Yin,
>>
>> thanks a lot for that! Will give it a shot and let you know.
>>
>> On 19 March 2015 at 16:30, Yin Huai  wrote:
>>
>>> Was the OOM thrown during the execution of first stage (map) or the
>>> second stage (reduce)? If it was the second stage, can you increase the
>>> value of spark.sql.shuffle.partitions and see if the OOM disappears?
>>>
>>> This setting controls the number of reduces Spark SQL will use and
>>> the default is 200. Maybe there are too many distinct values and the 
>>> memory
>>> pressure on every task (of those 200 reducers) is pretty high. You can
>>> start with 400 and increase it until the OOM disappears. Hopefully this
>>> will help.
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>>
>>> On Wed, Mar 18, 2015 at 4:46 PM, Yiannis Gkoufas <
>>> johngou...@gmail.com> wrote:
>>>
 Hi Yin,

 Thanks for your feedback. I have 1700 parquet files, sized 100MB
 each. The number of tasks launched is equal to the number of parquet 
 files.
 Do you have any idea on how to deal with this situation?

 Thanks a lot
 On 18 Mar 2015 17:35, "Yin Huai"  wrote:

> Seems there are too many distinct groups processed in a task,
> which trigger the problem.
>
> How many files do your dataset have and how large is a file? Seems
> your query will be executed with two stages, table scan and map-side
> aggregation in the first stage and the final round of reduce-side
> aggregation in the second stage. Can you take a look at the numbers of
> tasks launched in these two stages?
>
> Thanks,
>
> Yin
>
> On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas <
> johngou...@gmail.com> wrote:
>
>> Hi there, I set the executor memory to 8g but it didn't help
>>
>> On 18 March 2015 at 13:59, Cheng Lian 
>> wrote:
>>
>>> You should probably increase executor memory by setting
>>> "spark.executor.memory".
>>>
>>> Full list of available configurations can be found here
>>> http://spark.apache.org/docs/latest/configuration.html
>>>
>>> Cheng
>>>
>>>
>>> On 3/18/15 9:15 PM, Yiannis Gkoufas

[jira] [Reopened] (SPARK-6122) Upgrade Tachyon dependency to 0.6.0

2015-03-23 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-6122:


I reverted this because it looks like it was responsible for some testing 
failures due to the dependency changes.

> Upgrade Tachyon dependency to 0.6.0
> ---
>
> Key: SPARK-6122
> URL: https://issues.apache.org/jira/browse/SPARK-6122
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: Haoyuan Li
>Assignee: Calvin Jia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: enum-like types in Spark

2015-03-23 Thread Patrick Wendell
If the official solution from the Scala community is to use Java
enums, then it seems strange they aren't generated in scaldoc? Maybe
we can just fix that w/ Typesafe's help and then we can use them.

On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen  wrote:
> Yeah the fully realized #4, which gets back the ability to use it in
> switch statements (? in Scala but not Java?) does end up being kind of
> huge.
>
> I confess I'm swayed a bit back to Java enums, seeing what it
> involves. The hashCode() issue can be 'solved' with the hash of the
> String representation.
>
> On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid  wrote:
>> I've just switched some of my code over to the new format, and I just want
>> to make sure everyone realizes what we are getting into.  I went from 10
>> lines as java enums
>>
>> https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20
>>
>> to 30 lines with the new format:
>>
>> https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250
>>
>> its not just that its verbose.  each name has to be repeated 4 times, with
>> potential typos in some locations that won't be caught by the compiler.
>> Also, you have to manually maintain the "values" as you update the set of
>> enums, the compiler won't do it for you.
>>
>> The only downside I've heard for java enums is enum.hashcode().  OTOH, the
>> downsides for this version are: maintainability / verbosity, no values(),
>> more cumbersome to use from java, no enum map / enumset.
>>
>> I did put together a little util to at least get back the equivalent of
>> enum.valueOf() with this format
>>
>> https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala
>>
>> I'm not trying to prevent us from moving forward on this, its fine if this
>> is still what everyone wants, but I feel pretty strongly java enums make
>> more sense.
>>
>> thanks,
>> Imran
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]

2015-03-23 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376495#comment-14376495
 ] 

Patrick Wendell commented on SPARK-2331:


By the way - [~rxin] recently pointed out to me that EmptyRDD is private[spark].

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/EmptyRDD.scala#L27

Given that I'm sort of confused how people were using it before. I'm not 
totally sure how making a class private[spark] affects its use in a return type.

> SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
> --
>
> Key: SPARK-2331
> URL: https://issues.apache.org/jira/browse/SPARK-2331
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Ian Hummel
>Priority: Minor
>
> The return type for SparkContext.emptyRDD is EmptyRDD[T].
> It should be RDD[T].  That means you have to add extra type annotations on 
> code like the below (which creates a union of RDDs over some subset of paths 
> in a folder)
> {code}
> val rdds = Seq("a", "b", "c").foldLeft[RDD[String]](sc.emptyRDD[String]) { 
> (rdd, path) ⇒
>   rdd.union(sc.textFile(path))
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4227) Document external shuffle service

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4227:
---
Priority: Critical  (was: Major)

> Document external shuffle service
> -
>
> Key: SPARK-4227
> URL: https://issues.apache.org/jira/browse/SPARK-4227
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Critical
>
> We should add spark.shuffle.service.enabled to the Configuration page and 
> give instructions for launching the shuffle service as an auxiliary service 
> on YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4227) Document external shuffle service

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4227:
---
Target Version/s: 1.3.1, 1.4.0  (was: 1.3.0, 1.4.0)

> Document external shuffle service
> -
>
> Key: SPARK-4227
> URL: https://issues.apache.org/jira/browse/SPARK-4227
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>
> We should add spark.shuffle.service.enabled to the Configuration page and 
> give instructions for launching the shuffle service as an auxiliary service 
> on YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2858) Default log4j configuration no longer seems to work

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2858.

Resolution: Invalid

This is really old and I don't think it still an issue. I'm just closing this 
as invalid.

> Default log4j configuration no longer seems to work
> ---
>
> Key: SPARK-2858
> URL: https://issues.apache.org/jira/browse/SPARK-2858
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>
> For reasons unknown this doesn't seem to be working anymore. I deleted my 
> log4j.properties file and did a fresh build and it noticed it still gave me a 
> verbose stack trace when port 4040 was contented (which is a log we silence 
> in the conf). I actually think this was an issue even before [~sowen]'s 
> changes, so not sure what's up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5863) Improve performance of convertToScala codepath.

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5863:
---
Target Version/s: 1.4.0  (was: 1.3.1, 1.4.0)

> Improve performance of convertToScala codepath.
> ---
>
> Key: SPARK-5863
> URL: https://issues.apache.org/jira/browse/SPARK-5863
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Cristian
>Priority: Critical
>
> Was doing some perf testing on reading parquet files and noticed that moving 
> from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
> culprit showed up as being in ScalaReflection.convertRowToScala.
> Particularly this zip is the issue:
> {code}
> r.toSeq.zip(schema.fields.map(_.dataType))
> {code}
> I see there's a comment on that currently that this is slow but it wasn't 
> fixed. This actually produces a 3x degradation in parquet read performance, 
> at least in my test case.
> Edit: the map is part of the issue as well. This whole code block is in a 
> tight loop and allocates a new ListBuffer that needs to grow for each 
> transformation. A possible solution is to change to using seq.view which 
> would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6012) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6012:
---
Target Version/s: 1.4.0  (was: 1.3.1, 1.4.0)

> Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered 
> operator
> --
>
> Key: SPARK-6012
> URL: https://issues.apache.org/jira/browse/SPARK-6012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Max Seiden
>Priority: Critical
>
> h3. Summary
> I've found that a deadlock occurs when asking for the partitions from a 
> SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs 
> when a child RDDs asks the DAGScheduler for preferred partition locations 
> (which locks the scheduler) and eventually hits the #execute() of the 
> TakeOrdered operator, which submits tasks but is blocked when it also tries 
> to get preferred locations (in a separate thread). It seems like the 
> TakeOrdered op's #execute() method should not actually submit a task (it is 
> calling #executeCollect() and creating a new RDD) and should instead stay 
> more true to the comment a logically apply a Limit on top of a Sort. 
> In my particular case, I am forcing a repartition of a SchemaRDD with a 
> terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into 
> play.
> h3. Stack Traces
> h4. Task Submission
> {noformat}
> "main" prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() 
> [0x00010ed5e000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0x0007c4c239b8> (a 
> org.apache.spark.scheduler.JobWaiter)
> at java.lang.Object.wait(Object.java:503)
> at 
> org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
> - locked <0x0007c4c239b8> (a org.apache.spark.scheduler.JobWaiter)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:884)
> at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161)
> at 
> org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183)
> at 
> org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
> - locked <0x0007c36ce038> (a 
> org.apache.spark.sql.hive.HiveContext$$anon$7)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
> at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127)
> at 
> org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
> at 
> org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
> at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278)
> at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:79)
> at 
> org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
> at 
> org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
> at 
> org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getP

[jira] [Updated] (SPARK-6012) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6012:
---
Target Version/s: 1.3.1, 1.4.0  (was: 1.4.0)

> Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered 
> operator
> --
>
> Key: SPARK-6012
> URL: https://issues.apache.org/jira/browse/SPARK-6012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Max Seiden
>Priority: Critical
>
> h3. Summary
> I've found that a deadlock occurs when asking for the partitions from a 
> SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs 
> when a child RDDs asks the DAGScheduler for preferred partition locations 
> (which locks the scheduler) and eventually hits the #execute() of the 
> TakeOrdered operator, which submits tasks but is blocked when it also tries 
> to get preferred locations (in a separate thread). It seems like the 
> TakeOrdered op's #execute() method should not actually submit a task (it is 
> calling #executeCollect() and creating a new RDD) and should instead stay 
> more true to the comment a logically apply a Limit on top of a Sort. 
> In my particular case, I am forcing a repartition of a SchemaRDD with a 
> terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into 
> play.
> h3. Stack Traces
> h4. Task Submission
> {noformat}
> "main" prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() 
> [0x00010ed5e000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0x0007c4c239b8> (a 
> org.apache.spark.scheduler.JobWaiter)
> at java.lang.Object.wait(Object.java:503)
> at 
> org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
> - locked <0x0007c4c239b8> (a org.apache.spark.scheduler.JobWaiter)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:884)
> at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161)
> at 
> org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183)
> at 
> org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
> - locked <0x0007c36ce038> (a 
> org.apache.spark.sql.hive.HiveContext$$anon$7)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
> at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127)
> at 
> org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
> at 
> org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
> at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278)
> at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:79)
> at 
> org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
> at 
> org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
> at 
> org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getP

[jira] [Commented] (SPARK-5863) Improve performance of convertToScala codepath.

2015-03-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375230#comment-14375230
 ] 

Patrick Wendell commented on SPARK-5863:


Ah actually - I see [~marmbrus] was the one who set target to 1.4.0, so I'm 
gonna remove 1.3.1

> Improve performance of convertToScala codepath.
> ---
>
> Key: SPARK-5863
> URL: https://issues.apache.org/jira/browse/SPARK-5863
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Cristian
>Priority: Critical
>
> Was doing some perf testing on reading parquet files and noticed that moving 
> from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
> culprit showed up as being in ScalaReflection.convertRowToScala.
> Particularly this zip is the issue:
> {code}
> r.toSeq.zip(schema.fields.map(_.dataType))
> {code}
> I see there's a comment on that currently that this is slow but it wasn't 
> fixed. This actually produces a 3x degradation in parquet read performance, 
> at least in my test case.
> Edit: the map is part of the issue as well. This whole code block is in a 
> tight loop and allocates a new ListBuffer that needs to grow for each 
> transformation. A possible solution is to change to using seq.view which 
> would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5863) Improve performance of convertToScala codepath.

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5863:
---
Target Version/s: 1.3.1, 1.4.0  (was: 1.4.0)

> Improve performance of convertToScala codepath.
> ---
>
> Key: SPARK-5863
> URL: https://issues.apache.org/jira/browse/SPARK-5863
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Cristian
>Priority: Critical
>
> Was doing some perf testing on reading parquet files and noticed that moving 
> from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
> culprit showed up as being in ScalaReflection.convertRowToScala.
> Particularly this zip is the issue:
> {code}
> r.toSeq.zip(schema.fields.map(_.dataType))
> {code}
> I see there's a comment on that currently that this is slow but it wasn't 
> fixed. This actually produces a 3x degradation in parquet read performance, 
> at least in my test case.
> Edit: the map is part of the issue as well. This whole code block is in a 
> tight loop and allocates a new ListBuffer that needs to grow for each 
> transformation. A possible solution is to change to using seq.view which 
> would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5863) Improve performance of convertToScala codepath.

2015-03-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375229#comment-14375229
 ] 

Patrick Wendell commented on SPARK-5863:


This seems worth potentially fixing in 1.3.1, so I added that. I think it will 
depend how surgical the fix is.

> Improve performance of convertToScala codepath.
> ---
>
> Key: SPARK-5863
> URL: https://issues.apache.org/jira/browse/SPARK-5863
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Cristian
>Priority: Critical
>
> Was doing some perf testing on reading parquet files and noticed that moving 
> from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
> culprit showed up as being in ScalaReflection.convertRowToScala.
> Particularly this zip is the issue:
> {code}
> r.toSeq.zip(schema.fields.map(_.dataType))
> {code}
> I see there's a comment on that currently that this is slow but it wasn't 
> fixed. This actually produces a 3x degradation in parquet read performance, 
> at least in my test case.
> Edit: the map is part of the issue as well. This whole code block is in a 
> tight loop and allocates a new ListBuffer that needs to grow for each 
> transformation. A possible solution is to change to using seq.view which 
> would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6456) Spark Sql throwing exception on large partitioned data

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6456:
---
Component/s: (was: Spark Core)

> Spark Sql throwing exception on large partitioned data
> --
>
> Key: SPARK-6456
> URL: https://issues.apache.org/jira/browse/SPARK-6456
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: pankaj
> Fix For: 1.2.1
>
>
> Observation:
> Spark connects with hive Metastore. i am able to run simple queries like 
> show table and select.
> but throws below exception while running query on the hive Table having large 
> number of partitions.
> {code}
> Exception in thread "main" java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
> at`enter code here` 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.transport.TTransportException: 
> java.net.SocketTimeoutException: Read timed out
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785)
> at 
> org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6449) Driver OOM results in reported application result SUCCESS

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6449:
---
Component/s: (was: Spark Core)
 YARN

> Driver OOM results in reported application result SUCCESS
> -
>
> Key: SPARK-6449
> URL: https://issues.apache.org/jira/browse/SPARK-6449
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Ryan Williams
>
> I ran a job yesterday that according to the History Server and YARN RM 
> finished with status {{SUCCESS}}.
> Clicking around on the history server UI, there were too few stages run, and 
> I couldn't figure out why that would have been.
> Finally, inspecting the end of the driver's logs, I saw:
> {code}
> 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Shutting down remote daemon.
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remote daemon shut down; proceeding with flushing remote transports.
> 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
> Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded (of class java.lang.OutOfMemoryError)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0, (reason: Shutdown hook called before final status was reported.)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
> final status was reported.)
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remoting shut down.
> 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1426705269584_0055
> {code}
> The driver OOM'd, [the {{catch}} block that presumably should have caught 
> it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
>  threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and 
> written to the event log.
> This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-4925:


Thanks for bringing this up. Actually - realized this wasn't fixed by some of 
the other work we did. The issue is that we never published hive-thriftserver 
before (so simply undoing the changes I made didn't make this work for 
hive-thritfserver). We just need to add the -Phive-thriftserver profile here:

https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L122

If someone wants to send a patch I can merge it, and we can fix it for 1.3.1.

> Publish Spark SQL hive-thriftserver maven artifact 
> ---
>
> Key: SPARK-4925
> URL: https://issues.apache.org/jira/browse/SPARK-4925
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Alex Liu
>
> The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
> Cassandra.
> Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4925:
---
Target Version/s: 1.3.1

> Publish Spark SQL hive-thriftserver maven artifact 
> ---
>
> Key: SPARK-4925
> URL: https://issues.apache.org/jira/browse/SPARK-4925
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Alex Liu
>Priority: Critical
>
> The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
> Cassandra.
> Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4925:
---
Priority: Critical  (was: Major)

> Publish Spark SQL hive-thriftserver maven artifact 
> ---
>
> Key: SPARK-4925
> URL: https://issues.apache.org/jira/browse/SPARK-4925
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Alex Liu
>Priority: Critical
>
> The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
> Cassandra.
> Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4925:
---
Fix Version/s: (was: 1.2.1)
   (was: 1.3.0)

> Publish Spark SQL hive-thriftserver maven artifact 
> ---
>
> Key: SPARK-4925
> URL: https://issues.apache.org/jira/browse/SPARK-4925
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Alex Liu
>
> The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
> Cassandra.
> Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4925:
---
Affects Version/s: (was: 1.2.0)
   1.3.0
   1.2.1

> Publish Spark SQL hive-thriftserver maven artifact 
> ---
>
> Key: SPARK-4925
> URL: https://issues.apache.org/jira/browse/SPARK-4925
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Alex Liu
>Priority: Critical
>
> The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
> Cassandra.
> Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4123) Show dependency changes in pull requests

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4123:
---
Summary: Show dependency changes in pull requests  (was: Show new 
dependencies added in pull requests)

> Show dependency changes in pull requests
> 
>
> Key: SPARK-4123
> URL: https://issues.apache.org/jira/browse/SPARK-4123
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>        Reporter: Patrick Wendell
>Assignee: Brennon York
>Priority: Critical
>
> We should inspect the classpath of Spark's assembly jar for every pull 
> request. This only takes a few seconds in Maven and it will help weed out 
> dependency changes from the master branch. Ideally we'd post any dependency 
> changes in the pull request message.
> {code}
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
> $ git checkout apache/master
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
> $ diff my-classpath master-classpath
> < chill-java-0.3.6.jar
> < chill_2.10-0.3.6.jar
> ---
> > chill-java-0.5.0.jar
> > chill_2.10-0.5.0.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-03-20 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14372266#comment-14372266
 ] 

Patrick Wendell commented on SPARK-5081:


Hey @cbbetz - the last movement on this is that I reached out to the snappy 
author and asked whether our upgrading of snappy could have resulted in 
different sizes of compressed intermediate data. However, he was fairly adamant 
that this is not the case.

Unfortunately, the reports here are somewhat inconsistent and there exists no 
simple reproduction of this issue. In fact I think its likely there are 
multiple different things being discussed in this thread.

The way this can move forward is if someone is able to create a small 
reproduction that can be run by a Spark developer, then we can dig in and see 
what's going on. A reproduction would ideally demonstrate a verifiable 
regression between two versions of the upstream release, for instance showing 
much larger shuffle files, given the same input.

> Shuffle write increases
> ---
>
> Key: SPARK-5081
> URL: https://issues.apache.org/jira/browse/SPARK-5081
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
>Reporter: Kevin Jung
>Priority: Critical
> Attachments: Spark_Debug.pdf, diff.txt
>
>
> The size of shuffle write showing in spark web UI is much different when I 
> execute same spark job with same input data in both spark 1.1 and spark 1.2. 
> At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
> in spark 1.2. 
> I set spark.shuffle.manager option to hash because it's default value is 
> changed but spark 1.2 still writes shuffle output more than spark 1.1.
> It can increase disk I/O overhead exponentially as the input file gets bigger 
> and it causes the jobs take more time to complete. 
> In the case of about 100GB input, for example, the size of shuffle write is 
> 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
> spark 1.1
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |9|saveAsTextFile| |1169.4KB| |
> |12|combineByKey| |1265.4KB|1275.0KB|
> |6|sortByKey| |1276.5KB| |
> |8|mapPartitions| |91.0MB|1383.1KB|
> |4|apply| |89.4MB| |
> |5|sortBy|155.6MB| |98.1MB|
> |3|sortBy|155.6MB| | |
> |1|collect| |2.1MB| |
> |2|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |
> spark 1.2
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |12|saveAsTextFile| |1170.2KB| |
> |11|combineByKey| |1264.5KB|1275.0KB|
> |8|sortByKey| |1273.6KB| |
> |7|mapPartitions| |134.5MB|1383.1KB|
> |5|zipWithIndex| |132.5MB| |
> |4|sortBy|155.6MB| |146.9MB|
> |3|sortBy|155.6MB| | |
> |2|collect| |2.0MB| |
> |1|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6335) REPL :reset command also removes refs to SparkContext and SQLContext

2015-03-20 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371743#comment-14371743
 ] 

Patrick Wendell commented on SPARK-6335:


Yes - I agree with [~srowen] here. I don't think we need to go out of our way 
to support :reset with respect to our bootstrapping of the spark-shell state. 
I'd rather just ask users to terminate and restart the shell (only slightly 
more difficult, and much easier to support). Perhaps if we see a reset, we 
should just exit the shell to fail-fast.

> REPL :reset command also removes refs to SparkContext and SQLContext
> 
>
> Key: SPARK-6335
> URL: https://issues.apache.org/jira/browse/SPARK-6335
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 1.3.0
> Environment: Ubuntu 14.04 64-bit; spark-1.3.0-bin-hadoop2.4
>Reporter: Marko Bonaci
>Priority: Trivial
>
> I wasn't sure whether to mark it as a bug or an improvement, so I went for 
> more moderate option, since this is rather trivial, rarely used thing.
> Here's the repl printout:
> {code:java}
> 15/03/14 14:39:38 INFO SparkILoop: Created spark context..
> Spark context available as sc.
> 15/03/14 14:39:38 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> scala> val x = 8
> x: Int = 8
> scala> :reset
> Resetting repl state.
> Forgetting this session history:
> val x = 8
> Forgetting all expression results and named terms: $intp, sc, sqlContext, x
> scala> sc.parallelize(1 to 8)
> :8: error: not found: value sc
>   sc.parallelize(1 to 8)
>   ^
> scala> :quit
> Stopping spark context.
> :8: error: not found: value sc
>   sc.stop()
>   ^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6401) Unable to load a old API input format in Spark streaming

2015-03-20 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371699#comment-14371699
 ] 

Patrick Wendell commented on SPARK-6401:


If this is a matter of just adding a simple wrapper, then why not just do it? 
Hadoop 2.X still supports the 1.X API's and there are legacy integrations 
there. It doesn't seem like any more work for us since we already support this 
in the core Spark API's.

> Unable to load a old API input format in Spark streaming
> 
>
> Key: SPARK-6401
> URL: https://issues.apache.org/jira/browse/SPARK-6401
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Rémy DUBOIS
>Priority: Minor
>
> The fileStream method of the JavaStreamingContext class does not allow using 
> a old API InputFormat.
> This feature exists in Spark batch but not in streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6403) Launch master as spot instance on EC2

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6403:
---
Fix Version/s: (was: 1.2.1)

> Launch master as spot instance on EC2
> -
>
> Key: SPARK-6403
> URL: https://issues.apache.org/jira/browse/SPARK-6403
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2
>Affects Versions: 1.2.1
>Reporter: Adam Vogel
>Priority: Minor
>
> Currently the spark_ec2.py script only supports requesting slaves as spot 
> instances. Launching the master as a spot instance has potential cost 
> savings, at the risk of losing the Spark cluster without warning. Unless 
> users include logic for relaunching slaves when lost, it is usually the case 
> that all slaves are lost simultaneously. Thus, for jobs which do not require 
> resilience to losing spot instances, being able to launch the master as a 
> spot instance saves money.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6403) Launch master as spot instance on EC2

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6403:
---
Target Version/s:   (was: 1.2.1)

> Launch master as spot instance on EC2
> -
>
> Key: SPARK-6403
> URL: https://issues.apache.org/jira/browse/SPARK-6403
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2
>Affects Versions: 1.2.1
>Reporter: Adam Vogel
>Priority: Minor
>
> Currently the spark_ec2.py script only supports requesting slaves as spot 
> instances. Launching the master as a spot instance has potential cost 
> savings, at the risk of losing the Spark cluster without warning. Unless 
> users include logic for relaunching slaves when lost, it is usually the case 
> that all slaves are lost simultaneously. Thus, for jobs which do not require 
> resilience to losing spot instances, being able to launch the master as a 
> spot instance saves money.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6404.

Resolution: Invalid

I'm closing the issue because broadcast variables are immutable, so you can't 
change their value. An approach like the one Saisai suggested is better... just 
create a new broadcast variable for each batch. If there is some other issue 
with that, we can create a new JIRA to fix it.

> Call broadcast() in each interval for spark streaming programs.
> ---
>
> Key: SPARK-6404
> URL: https://issues.apache.org/jira/browse/SPARK-6404
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Yifan Wang
>
> If I understand it correctly, Spark’s broadcast() function will be called 
> only once at the beginning of the batch. For streaming applications that need 
> to run for 24/7, it is often needed to update variables that shared by 
> broadcast() dynamically. It would be ideal if broadcast() could be called at 
> the beginning of each interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6414) Spark driver failed with NPE on job cancelation

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6414:
---
Priority: Critical  (was: Major)

> Spark driver failed with NPE on job cancelation
> ---
>
> Key: SPARK-6414
> URL: https://issues.apache.org/jira/browse/SPARK-6414
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Yuri Makhno
>Priority: Critical
>
> When a job group is cancelled, we scan through all jobs to determine which 
> are members of the group. This scan assumes that the job group property is 
> always set. If 'properties' is null in an active job, you get an NPE.
> We just need to make sure we ignore ones where activeJob.properties is null. 
> We should also make sure it works if the particular property is missing.
> https://github.com/apache/spark/blob/branch-1.3/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L678



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6414) Spark driver failed with NPE on job cancelation

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6414:
---
Description: 
When a job group is cancelled, we scan through all jobs to determine which are 
members of the group. This scan assumes that the job group property is always 
set. If 'properties' is null in an active job, you get an NPE.

We just need to make sure we ignore ones where activeJob.properties is null. We 
should also make sure it works if the particular property is missing.

https://github.com/apache/spark/blob/branch-1.3/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L678

  was:Spark failed with NPE in DAGScheduler.handleJobGroupCancelled:681 when 
there are some active jobs which don't have any properties (i.e. 
activeJob.properties is null)


> Spark driver failed with NPE on job cancelation
> ---
>
> Key: SPARK-6414
> URL: https://issues.apache.org/jira/browse/SPARK-6414
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Yuri Makhno
>
> When a job group is cancelled, we scan through all jobs to determine which 
> are members of the group. This scan assumes that the job group property is 
> always set. If 'properties' is null in an active job, you get an NPE.
> We just need to make sure we ignore ones where activeJob.properties is null. 
> We should also make sure it works if the particular property is missing.
> https://github.com/apache/spark/blob/branch-1.3/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L678



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6414) Spark driver failed with NPE on job cancelation

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6414:
---
Affects Version/s: 1.3.0

> Spark driver failed with NPE on job cancelation
> ---
>
> Key: SPARK-6414
> URL: https://issues.apache.org/jira/browse/SPARK-6414
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Yuri Makhno
>
> Spark failed with NPE in DAGScheduler.handleJobGroupCancelled:681 when there 
> are some active jobs which don't have any properties (i.e. 
> activeJob.properties is null)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6436) io/netty missing from external shuffle service jars for yarn

2015-03-20 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371642#comment-14371642
 ] 

Patrick Wendell commented on SPARK-6436:


This might be related to SPARK-6070 by [~vanzin]. That removed some classes 
from the network jar. However, the reason was they were supposedly included in 
YARN already. Maybe it depends one the version of YARN?

> io/netty missing from external shuffle service jars for yarn
> 
>
> Key: SPARK-6436
> URL: https://issues.apache.org/jira/browse/SPARK-6436
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.3.0
>Reporter: Thomas Graves
>
> I was trying to use the external shuffle service on yarn but it appears that 
> io/netty isn't included in the network jars.  I loaded up network-common, 
> network-yarn, and network-shuffle.  If there is some other jar supposed to be 
> included please let me know.
> 2015-03-20 14:25:07,142 [main] FATAL 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting 
> NodeManager
> java.lang.NoClassDefFoundError: io/netty/channel/EventLoopGroup
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockManager.(ExternalShuffleBlockManager.java:64)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:53)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:105)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6415) Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills the app

2015-03-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6415:
---
Component/s: Streaming

> Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills 
> the app
> -
>
> Key: SPARK-6415
> URL: https://issues.apache.org/jira/browse/SPARK-6415
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>
> Of course, this would have to be done as a configurable param, but such a 
> fail-fast is useful else it is painful to figure out what is happening when 
> there are cascading failures. In some cases, the SparkContext shuts down and 
> streaming keeps scheduling jobs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: enum-like types in Spark

2015-03-16 Thread Patrick Wendell
Hey Xiangrui,

Do you want to write up a straw man proposal based on this line of discussion?

- Patrick

On Mon, Mar 16, 2015 at 12:12 PM, Kevin Markey  wrote:
> In some applications, I have rather heavy use of Java enums which are needed
> for related Java APIs that the application uses.  And unfortunately, they
> are also used as keys.  As such, using the native hashcodes makes any
> function over keys unstable and unpredictable, so we now use Enum.name() as
> the key instead.  Oh well.  But it works and seems to work well.
>
> Kevin
>
>
> On 03/05/2015 09:49 PM, Mridul Muralidharan wrote:
>>
>>I have a strong dislike for java enum's due to the fact that they
>> are not stable across JVM's - if it undergoes serde, you end up with
>> unpredictable results at times [1].
>> One of the reasons why we prevent enum's from being key : though it is
>> highly possible users might depend on it internally and shoot
>> themselves in the foot.
>>
>> Would be better to keep away from them in general and use something more
>> stable.
>>
>> Regards,
>> Mridul
>>
>> [1] Having had to debug this issue for 2 weeks - I really really hate it.
>>
>>
>> On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid  wrote:
>>>
>>> I have a very strong dislike for #1 (scala enumerations).   I'm ok with
>>> #4
>>> (with Xiangrui's final suggestion, especially making it sealed &
>>> available
>>> in Java), but I really think #2, java enums, are the best option.
>>>
>>> Java enums actually have some very real advantages over the other
>>> approaches -- you get values(), valueOf(), EnumSet, and EnumMap.  There
>>> has
>>> been endless debate in the Scala community about the problems with the
>>> approaches in Scala.  Very smart, level-headed Scala gurus have
>>> complained
>>> about their short-comings (Rex Kerr's name is coming to mind, though I'm
>>> not positive about that); there have been numerous well-thought out
>>> proposals to give Scala a better enum.  But the powers-that-be in Scala
>>> always reject them.  IIRC the explanation for rejecting is basically that
>>> (a) enums aren't important enough for introducing some new special
>>> feature,
>>> scala's got bigger things to work on and (b) if you really need a good
>>> enum, just use java's enum.
>>>
>>> I doubt it really matters that much for Spark internals, which is why I
>>> think #4 is fine.  But I figured I'd give my spiel, because every
>>> developer
>>> loves language wars :)
>>>
>>> Imran
>>>
>>>
>>>
>>> On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng  wrote:
>>>
>>>> `case object` inside an `object` doesn't show up in Java. This is the
>>>> minimal code I found to make everything show up correctly in both
>>>> Scala and Java:
>>>>
>>>> sealed abstract class StorageLevel // cannot be a trait
>>>>
>>>> object StorageLevel {
>>>>private[this] case object _MemoryOnly extends StorageLevel
>>>>final val MemoryOnly: StorageLevel = _MemoryOnly
>>>>
>>>>private[this] case object _DiskOnly extends StorageLevel
>>>>final val DiskOnly: StorageLevel = _DiskOnly
>>>> }
>>>>
>>>> On Wed, Mar 4, 2015 at 8:10 PM, Patrick Wendell 
>>>> wrote:
>>>>>
>>>>> I like #4 as well and agree with Aaron's suggestion.
>>>>>
>>>>> - Patrick
>>>>>
>>>>> On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson 
>>>>
>>>> wrote:
>>>>>>
>>>>>> I'm cool with #4 as well, but make sure we dictate that the values
>>>>
>>>> should
>>>>>>
>>>>>> be defined within an object with the same name as the enumeration
>>>>>> (like
>>>>
>>>> we
>>>>>>
>>>>>> do for StorageLevel). Otherwise we may pollute a higher namespace.
>>>>>>
>>>>>> e.g. we SHOULD do:
>>>>>>
>>>>>> trait StorageLevel
>>>>>> object StorageLevel {
>>>>>>case object MemoryOnly extends StorageLevel
>>>>>>case object DiskOnly extends StorageLevel
>>>>>> }
>>>>>>
>>>>>> On Wed, Mar

[jira] [Updated] (SPARK-6362) Broken pipe error when training a RandomForest on a union of two RDDs

2015-03-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6362:
---
Component/s: PySpark

> Broken pipe error when training a RandomForest on a union of two RDDs
> -
>
> Key: SPARK-6362
> URL: https://issues.apache.org/jira/browse/SPARK-6362
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
> Environment: Kubuntu 14.04, local driver
>Reporter: Pavel Laskov
>Priority: Minor
>
> Training a RandomForest classifier on a dataset obtained as a union of two 
> RDDs throws a broken pipe error:
> Traceback (most recent call last):
>   File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 162, in 
> manager
> code = worker(sock)
>   File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 64, in 
> worker
> outfile.flush()
> IOError: [Errno 32] Broken pipe
> Despite an error the job runs to completion. 
> The following code reproduces the error:
> from pyspark.context import SparkContext
> from pyspark.mllib.rand import RandomRDDs
> from pyspark.mllib.tree import RandomForest
> from pyspark.mllib.linalg import DenseVector
> from pyspark.mllib.regression import LabeledPoint
> import random
> if __name__ == "__main__":
> sc = SparkContext(appName="Union bug test")
> data1 = RandomRDDs.normalVectorRDD(sc,numRows=1,numCols=200)
> data1 = data1.map(lambda x: LabeledPoint(random.randint(0,1),\
>  DenseVector(x)))
> data2 = RandomRDDs.normalVectorRDD(sc,numRows=1,numCols=200)
> data2 = data2.map(lambda x: LabeledPoint(random.randint(0,1),\
> DenseVector(x)))
> training_data = data1.union(data2)
> #training_data = training_data.repartition(2)
> model = RandomForest.trainClassifier(training_data, numClasses=2,
>  categoricalFeaturesInfo={},
>  numTrees=50, maxDepth=30)
> Interestingly, re-partitioning the data after the union operation rectifies 
> the problem (uncomment the line before training in the code above). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5310) Update SQL programming guide for 1.3

2015-03-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362757#comment-14362757
 ] 

Patrick Wendell commented on SPARK-5310:


[~lian cheng] and [~marmbrus] this resulted in a PR which I don't actually 
think was merged before I pushed the release docs. I'm going to try and update 
the published docs now. In any case - can this issue be closed?

> Update SQL programming guide for 1.3
> 
>
> Key: SPARK-5310
> URL: https://issues.apache.org/jira/browse/SPARK-5310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> We make quite a few changes. We should update the SQL programming guide to 
> reflect these changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6307) Executers fetches the same rdd-block 100's or 1000's of times

2015-03-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362689#comment-14362689
 ] 

Patrick Wendell commented on SPARK-6307:


Thanks for reporting this and attempting to isolate it. Is this issue 
deterministic or does it only happen sometimes? Also, is there any chance you 
can attach an input file (or write a program that generates its own input)? 
Getting a more isolated reproduction will help us figure out what's going on 
and fix it.

> Executers fetches the same rdd-block 100's or 1000's of times
> -
>
> Key: SPARK-6307
> URL: https://issues.apache.org/jira/browse/SPARK-6307
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.2.0
> Environment: Linux, Spark Standalone 1.2, running in a PBS grid engine
>Reporter: Tobias Bertelsen
>
> The block manager keept fetching the same blocks over and over, making tasks 
> with network activity extremely slow. Two identical tasks can take between 12 
> seconds up to more than an hour. (where I stopped it).
> Spark should cache the blocks, so it does not fetch the same blocks over, and 
> over, and over.
> Here is a simplified version of the code that provokes it:
> {code}
> // Read a few thousand lines (~ 15 MB)
> val fileContents = sc.newAPIHadoopFile(path, ..).repartition(16)
> val data = fileContents.map{x => parseContent(x)}.cache()
> // Do a pairwise comparison and count the best pairs
> val pairs = data.cartesian(data).filter { case ((x,y) =>
>   similarity(x, y) > 0.9
> }
> pairs.count()
> {code}
> This is a tiny fraction of one of the worker's stderr:
> {code}
> 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_2 remotely
> 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_2 remotely
> 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_1 remotely
> 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_0 remotely
> Thousands more lines, fetching the same 16 remote blocks
> 15/03/12 22:25:44 INFO BlockManager: Found block rdd_8_0 remotely
> 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
> 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
> 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
> 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
> {code}
> h2. Details for that stage from the UI.
>  - *Total task time across all tasks:* 11.9 h
>  - *Input:* 2.2 GB
>  - *Shuffle read:* 4.5 MB
> h3. Summary Metrics for 176 Completed Tasks
> || Metric || Min || 25th percentile || Median || 75th percentile || Max ||
> | Duration | 7 s | 8 s | 8 s | 12 s | 59 min |
> | GC Time | 0 ms | 99 ms | 0.1 s | 0.2 s | 0.5 s |
> | Input | 6.9 MB | 8.2 MB | 8.4 MB | 9.0 MB | 11.0 MB |
> | Shuffle Read (Remote) | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 676.6 KB |
> h3. Aggregated Metrics by Executor
> || Executor ID || Address || Task Time || Total Tasks || Failed Tasks || 
> Succeeded Tasks || Input || Output || Shuffle Read || Shuffle Write || 
> Shuffle Spill (Memory) || Shuffle Spill (Disk) ||
> | 0 | n-62-23-3:49566 | 5.7 h | 9 | 0 | 9 | 171.0 MB | 0.0 B | 0.0 B | 0.0 B 
> | 0.0 B | 0.0 B |
> | 1 | n-62-23-6:57518 | 16.4 h | 20 | 0 | 20 | 169.9 MB | 0.0 B | 0.0 B | 0.0 
> B | 0.0 B | 0.0 B |
> | 2 | n-62-18-48:33551 | 0 ms | 0 | 0 | 0 | 169.6 MB | 0.0 B | 0.0 B | 0.0 B 
> | 0.0 B | 0.0 B |
> | 3 | n-62-23-5:58421 | 2.9 min | 12 | 0 | 12 | 266.2 MB | 0.0 B | 4.5 MB | 
> 0.0 B | 0.0 B | 0.0 B |
> | 4 | n-62-23-1:40096 | 23 min | 164 | 0 | 164 | 1430.4 MB | 0.0 B | 0.0 B | 
> 0.0 B | 0.0 B | 0.0 B |
> h3. Tasks
> || Index || ID || Attempt || Status || Locality Level || Executor ID / Host 
> || Launch Time || Duration || GC Time || Input || Shuffle Read || Errors ||
> | 1 | 2 | 0 | SUCCESS | ANY | 3 / n-62-23-5 | 2015/03/12 21:55:00 | 12 s | 
> 0.1 s | 6.9 MB (memory) | 676.6 KB || 
> | 0 | 1 | 0 | SUCCESS | ANY | 0 / n-62-23-3 | 2015/03/12 21:55:00 | 39 min | 
> 0.3 s | 8.7 MB (network) | 0.0 B || 
> | 4 | 5 | 0 | SUCCESS | ANY | 1 / n-62-23-6 | 2015/03/12 21:55:00 | 38 min | 
> 0.4 s | 8.6 MB (network) | 0.0 B || 
> | 3 | 4 | 0 | RUNNING | ANY | 2 / n-62-18-48 | 2015/03/12 21:55:00 | 55 min | 
>  | 8.3 MB (network) | 0.0 B || 
> | 2 | 3 | 0 | SUCCESS | ANY | 4 / n-62-23-1 | 2015/03/12 21:55:00 | 11 s | 
> 0.3 s | 8.4 MB (memory) | 0.0 B || 
> | 7 | 8 | 0 | SUCCESS | ANY | 4 / n-62-23-1 | 2015/03/12 21:55:00 | 12 s | 
> 0.3 s | 9.2 MB (memory) | 0.0 B || 
> | 6 | 7 | 0 | SUCCESS | ANY | 3 / n-62-23-

[jira] [Commented] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount

2015-03-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362687#comment-14362687
 ] 

Patrick Wendell commented on SPARK-6313:


[~joshrosen] changing default caching behavior seems like it could silently 
regress performance for the vas majority of users who aren't on NFS. What about 
a hotfix for 1.3.1 that just exposes the config for NFS users (this is very 
small population), but doesn't change the default. That may be sufficient in 
itself... or if we want a real fix that makes it work out-of-the-box on NDFS, 
we can put it in 1.4.

> Fetch File Lock file creation doesnt work when Spark working dir is on a NFS 
> mount
> --
>
> Key: SPARK-6313
> URL: https://issues.apache.org/jira/browse/SPARK-6313
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0, 1.2.1
>Reporter: Nathan McCarthy
>Priority: Critical
>
> When running in cluster mode and mounting the spark work dir on a NFS volume 
> (or some volume which doesn't support file locking), the fetchFile (used for 
> downloading JARs etc on the executors) method in Spark Utils class will fail. 
> This file locking was introduced as an improvement with SPARK-2713. 
> See 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415
>  
> Introduced in 1.2 in commit; 
> https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 
> As this locking is for optimisation for fetching files, could we take a 
> different approach here to create a temp/advisory lock file? 
> Typically you would just mount local disks (in say ext4 format) and provide 
> this as a comma separated list however we are trying to run Spark on MapR. 
> With MapR we can do a loop back mount to a volume on the local node and take 
> advantage of MapRs disk pools. This also means we dont need specific mounts 
> for Spark and improves the generic nature of the cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount

2015-03-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6313:
---
Target Version/s: 1.3.1

> Fetch File Lock file creation doesnt work when Spark working dir is on a NFS 
> mount
> --
>
> Key: SPARK-6313
> URL: https://issues.apache.org/jira/browse/SPARK-6313
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0, 1.2.1
>Reporter: Nathan McCarthy
>Priority: Critical
>
> When running in cluster mode and mounting the spark work dir on a NFS volume 
> (or some volume which doesn't support file locking), the fetchFile (used for 
> downloading JARs etc on the executors) method in Spark Utils class will fail. 
> This file locking was introduced as an improvement with SPARK-2713. 
> See 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415
>  
> Introduced in 1.2 in commit; 
> https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 
> As this locking is for optimisation for fetching files, could we take a 
> different approach here to create a temp/advisory lock file? 
> Typically you would just mount local disks (in say ext4 format) and provide 
> this as a comma separated list however we are trying to run Spark on MapR. 
> With MapR we can do a loop back mount to a volume on the local node and take 
> advantage of MapRs disk pools. This also means we dont need specific mounts 
> for Spark and improves the generic nature of the cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Wrong version on the Spark documentation page

2015-03-15 Thread Patrick Wendell
Cheng - what if you hold shift+refresh? For me the /latest link
correctly points to 1.3.0

On Sun, Mar 15, 2015 at 10:40 AM, Cheng Lian  wrote:
> It's still marked as 1.2.1 here http://spark.apache.org/docs/latest/
>
> But this page is updated (1.3.0)
> http://spark.apache.org/docs/latest/index.html
>
> Cheng
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: May we merge into branch-1.3 at this point?

2015-03-13 Thread Patrick Wendell
Hey Sean,

Yes, go crazy. Once we close the release vote, it's open season to
merge backports into that release.

- Patrick

On Fri, Mar 13, 2015 at 9:31 AM, Mridul Muralidharan  wrote:
> Who is managing 1.3 release ? You might want to coordinate with them before
> porting changes to branch.
>
> Regards
> Mridul
>
> On Friday, March 13, 2015, Sean Owen  wrote:
>
>> Yeah, I'm guessing that is all happening quite literally as we speak.
>> The Apache git tag is the one of reference:
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc
>>
>> Open season on 1.3 branch then...
>>
>> On Fri, Mar 13, 2015 at 4:20 PM, Nicholas Chammas
>> > wrote:
>> > Looks like the release is out:
>> > http://spark.apache.org/releases/spark-release-1-3-0.html
>> >
>> > Though, interestingly, I think we are missing the appropriate v1.3.0 tag:
>> > https://github.com/apache/spark/releases
>> >
>> > Nick
>> >
>> > On Fri, Mar 13, 2015 at 6:07 AM Sean Owen > > wrote:
>> >>
>> >> Is the release certain enough that we can resume merging into
>> >> branch-1.3 at this point? I have a number of back-ports queued up and
>> >> didn't want to merge in case another last RC was needed. I see a few
>> >> commits to the branch though.
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> >>
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>> For additional commands, e-mail: dev-h...@spark.apache.org 
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Updated] (SPARK-4964) Exactly-once + WAL-free Kafka Support in Spark Streaming

2015-03-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4964:
---
Assignee: Cody Koeninger

> Exactly-once + WAL-free Kafka Support in Spark Streaming
> 
>
> Key: SPARK-4964
> URL: https://issues.apache.org/jira/browse/SPARK-4964
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
> Fix For: 1.3.0
>
>
> There are two issues with the current Kafka support 
>  - Use of Write Ahead Logs in Spark Streaming to ensure no data is lost - 
> Causes data replication in both Kafka AND Spark Streaming. 
>  - Lack of exactly-once semantics - For background, see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html
> We want to solve both these problem in JIRA. Please see the following design 
> doc for the solution. 
> https://docs.google.com/a/databricks.com/document/d/1IuvZhg9cOueTf1mq4qwc1fhPb5FVcaRLcyjrtG4XU1k/edit#heading=h.itproy77j3p



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Patrick Wendell
Hi All,

I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is
the fourth release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 172 developers and more
than 1,000 commits!

Visit the release notes [1] to read about the new features, or
download [2] the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone who helped work on this release!

[1] http://spark.apache.org/releases/spark-release-1-3-0.html
[2] http://spark.apache.org/downloads.html

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Patrick Wendell
Hi All,

I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is
the fourth release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 172 developers and more
than 1,000 commits!

Visit the release notes [1] to read about the new features, or
download [2] the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone who helped work on this release!

[1] http://spark.apache.org/releases/spark-release-1-3-0.html
[2] http://spark.apache.org/downloads.html

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Updated] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount

2015-03-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6313:
---
Priority: Critical  (was: Major)

> Fetch File Lock file creation doesnt work when Spark working dir is on a NFS 
> mount
> --
>
> Key: SPARK-6313
> URL: https://issues.apache.org/jira/browse/SPARK-6313
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0, 1.2.1
>Reporter: Nathan McCarthy
>Priority: Critical
>
> When running in cluster mode and mounting the spark work dir on a NFS volume 
> (or some volume which doesn't support file locking), the fetchFile (used for 
> downloading JARs etc on the executors) method in Spark Utils class will fail. 
> This file locking was introduced as an improvement with SPARK-2713. 
> See 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415
>  
> Introduced in 1.2 in commit; 
> https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 
> As this locking is for optimisation for fetching files, could we take a 
> different approach here to create a temp/advisory lock file? 
> Typically you would just mount local disks (in say ext4 format) and provide 
> this as a comma separated list however we are trying to run Spark on MapR. 
> With MapR we can do a loop back mount to a volume on the local node and take 
> advantage of MapRs disk pools. This also means we dont need specific mounts 
> for Spark and improves the generic nature of the cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6311) ChiSqTest should check for too few counts

2015-03-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6311.

Resolution: Duplicate

> ChiSqTest should check for too few counts
> -
>
> Key: SPARK-6311
> URL: https://issues.apache.org/jira/browse/SPARK-6311
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> ChiSqTest assumes that elements of the contingency matrix are large enough 
> (have enough counts) s.t. the central limit theorem kicks in.  It would be 
> reasonable to do one or more of the following:
> * Add a note in the docs about making sure there are a reasonable number of 
> instances being used (or counts in the contingency table entries, to be more 
> precise and account for skewed category distributions).
> * Add a check in the code which could:
> ** Log a warning message
> ** Alter the p-value to make sure it indicates the test result is 
> insignificant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6310) ChiSqTest should check for too few counts

2015-03-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6310.

Resolution: Duplicate

> ChiSqTest should check for too few counts
> -
>
> Key: SPARK-6310
> URL: https://issues.apache.org/jira/browse/SPARK-6310
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> ChiSqTest assumes that elements of the contingency matrix are large enough 
> (have enough counts) s.t. the central limit theorem kicks in.  It would be 
> reasonable to do one or more of the following:
> * Add a note in the docs about making sure there are a reasonable number of 
> instances being used (or counts in the contingency table entries, to be more 
> precise and account for skewed category distributions).
> * Add a check in the code which could:
> ** Log a warning message
> ** Alter the p-value to make sure it indicates the test result is 
> insignificant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark

2015-03-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359166#comment-14359166
 ] 

Patrick Wendell commented on SPARK-5654:


I see the decision here as somewhat orthogonal to vendors and vendor packaging. 
Vendors can chose whether to package this component or not, and some may leave 
it out until it gets more mature. Of course, they are more encouraged/pressured 
to package things that end up inside the project itself, but that could be used 
to justify merging all kinds of random stuff into Spark, so I don't think it's 
a sufficient justification.

The main argument as I said before is just that non-JVM language API's are 
really just not possible to maintain outside of the project, because it's not 
building on any even remotely "public" API. Imagine if we tried to have PySpark 
as it's own project, it is so tightly coupled that it wouldn't work.

I have argued in the past for things to existing outside the project when they 
can, and that I still promote that strongly.

> Integrate SparkR into Apache Spark
> --
>
> Key: SPARK-5654
> URL: https://issues.apache.org/jira/browse/SPARK-5654
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Reporter: Shivaram Venkataraman
>
> The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
> from R. The project was started at the AMPLab around a year ago and has been 
> incubated as its own project to make sure it can be easily merged into 
> upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s 
> goals are similar to PySpark and shares a similar design pattern as described 
> in our meetup talk[2], Spark Summit presentation[3].
> Integrating SparkR into the Apache project will enable R users to use Spark 
> out of the box and given R’s large user base, it will help the Spark project 
> reach more users.  Additionally, work in progress features like providing R 
> integration with ML Pipelines and Dataframes can be better achieved by 
> development in a unified code base.
> SparkR is available under the Apache 2.0 License and does not have any 
> external dependencies other than requiring users to have R and Java installed 
> on their machines.  SparkR’s developers come from many organizations 
> including UC Berkeley, Alteryx, Intel and we will support future development, 
> maintenance after the integration.
> [1] https://github.com/amplab-extras/SparkR-pkg
> [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
> [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2015-03-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4924.

   Resolution: Fixed
Fix Version/s: 1.4.0

Glad to finally have this in. Thanks for all the hard work [~vanzin]!

> Factor out code to launch Spark applications into a separate library
> 
>
> Key: SPARK-4924
> URL: https://issues.apache.org/jira/browse/SPARK-4924
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.4.0
>
> Attachments: spark-launcher.txt
>
>
> One of the questions we run into rather commonly is "how to start a Spark 
> application from my Java/Scala program?". There currently isn't a good answer 
> to that:
> - Instantiating SparkContext has limitations (e.g., you can only have one 
> active context at the moment, plus you lose the ability to submit apps in 
> cluster mode)
> - Calling SparkSubmit directly is doable but you lose a lot of the logic 
> handled by the shell scripts
> - Calling the shell script directly is doable,  but sort of ugly from an API 
> point of view.
> I think it would be nice to have a small library that handles that for users. 
> On top of that, this library could be used by Spark itself to replace a lot 
> of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: How to set per-user spark.local.dir?

2015-03-11 Thread Patrick Wendell
We don't support expressions or wildcards in that configuration. For
each application, the local directories need to be constant. If you
have users submitting different Spark applications, those can each set
spark.local.dirs.

- Patrick

On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang  wrote:
> Hi,
>
> I need to set per-user spark.local.dir, how can I do that?
>
> I tried both
>
>   /x/home/${user.name}/spark/tmp
> and
>   /x/home/${USER}/spark/tmp
>
> And neither worked. Looks like it has to be a constant setting in
> spark-defaults.conf. Right?
>
> Any ideas how to do that?
>
> Thanks,
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-10 Thread Patrick Wendell
This vote passes with 13 +1 votes (6 binding) and no 0 or -1 votes:

+1 (13):
Patrick Wendell*
Marcelo Vanzin
Krishna Sankar
Sean Owen*
Matei Zaharia*
Sandy Ryza
Tom Graves*
Sean McNamara*
Denny Lee
Kostas Sakellis
Joseph Bradley*
Corey Nolet
GuoQiang Li

0:
-1:

I will finalize the release notes and packaging and will post the
release in the next two days.

- Patrick

On Mon, Mar 9, 2015 at 11:51 PM, GuoQiang Li  wrote:
> I'm sorry, this is my mistake. :)
>
>
> -- 原始邮件 ------
> 发件人: "Patrick Wendell";
> 发送时间: 2015年3月10日(星期二) 下午2:20
> 收件人: "GuoQiang Li";
> 主题: Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
>
> Thanks! But please e-mail the dev list and not just me personally :)
>
> On Mon, Mar 9, 2015 at 11:08 PM, GuoQiang Li  wrote:
>> +1 (non-binding)
>>
>> Test on Mac OS X 10.10.2 and CentOS 6.5
>>
>>
>> -- Original --
>> From:  "Patrick Wendell";;
>> Date:  Fri, Mar 6, 2015 10:52 AM
>> To:  "dev@spark.apache.org";
>> Subject:  [VOTE] Release Apache Spark 1.3.0 (RC3)
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.3.0!
>>
>> The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4):
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc3/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> Staging repositories for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1078
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc3-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.3.0!
>>
>> The vote is open until Monday, March 09, at 02:52 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == How does this compare to RC2 ==
>> This release includes the following bug fixes:
>>
>> https://issues.apache.org/jira/browse/SPARK-6144
>> https://issues.apache.org/jira/browse/SPARK-6171
>> https://issues.apache.org/jira/browse/SPARK-5143
>> https://issues.apache.org/jira/browse/SPARK-6182
>> https://issues.apache.org/jira/browse/SPARK-6175
>>
>> == How can I help test this release? ==
>> If you are a Spark user, you can help us test this release by
>> taking a Spark 1.2 workload and running on this release candidate,
>> then reporting any regressions.
>>
>> If you are happy with this release based on your own testing, give a +1
>> vote.
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening towards the end of the 1.3 QA period,
>> so -1 votes should only occur for significant regressions from 1.2.1.
>> Bugs already present in 1.2.X, minor regressions, or bugs related
>> to new features will not block this release.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: enum-like types in Spark

2015-03-09 Thread Patrick Wendell
Does this matter for our own internal types in Spark? I don't think
any of these types are designed to be used in RDD records, for
instance.

On Mon, Mar 9, 2015 at 6:25 PM, Aaron Davidson  wrote:
> Perhaps the problem with Java enums that was brought up was actually that
> their hashCode is not stable across JVMs, as it depends on the memory
> location of the enum itself.
>
> On Mon, Mar 9, 2015 at 6:15 PM, Imran Rashid  wrote:
>
>> Can you expand on the serde issues w/ java enum's at all?  I haven't heard
>> of any problems specific to enums.  The java object serialization rules
>> seem very clear and it doesn't seem like different jvms should have a
>> choice on what they do:
>>
>>
>> http://docs.oracle.com/javase/6/docs/platform/serialization/spec/serial-arch.html#6469
>>
>> (in a nutshell, serialization must use enum.name())
>>
>> of course there are plenty of ways the user could screw this up(eg. rename
>> the enums, or change their meaning, or remove them).  But then again, all
>> of java serialization has issues w/ serialization the user has to be aware
>> of.  Eg., if we go with case objects, than java serialization blows up if
>> you add another helper method, even if that helper method is completely
>> compatible.
>>
>> Some prior debate in the scala community:
>>
>> https://groups.google.com/d/msg/scala-internals/8RWkccSRBxQ/AN5F_ZbdKIsJ
>>
>> SO post on which version to use in scala:
>>
>>
>> http://stackoverflow.com/questions/1321745/how-to-model-type-safe-enum-types
>>
>> SO post about the macro-craziness people try to add to scala to make them
>> almost as good as a simple java enum:
>> (NB: the accepted answer doesn't actually work in all cases ...)
>>
>>
>> http://stackoverflow.com/questions/20089920/custom-scala-enum-most-elegant-version-searched
>>
>> Another proposal to add better enums built into scala ... but seems to be
>> dormant:
>>
>> https://groups.google.com/forum/#!topic/scala-sips/Bf82LxK02Kk
>>
>>
>>
>> On Thu, Mar 5, 2015 at 10:49 PM, Mridul Muralidharan 
>> wrote:
>>
>> >   I have a strong dislike for java enum's due to the fact that they
>> > are not stable across JVM's - if it undergoes serde, you end up with
>> > unpredictable results at times [1].
>> > One of the reasons why we prevent enum's from being key : though it is
>> > highly possible users might depend on it internally and shoot
>> > themselves in the foot.
>> >
>> > Would be better to keep away from them in general and use something more
>> > stable.
>> >
>> > Regards,
>> > Mridul
>> >
>> > [1] Having had to debug this issue for 2 weeks - I really really hate it.
>> >
>> >
>> > On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid 
>> wrote:
>> > > I have a very strong dislike for #1 (scala enumerations).   I'm ok with
>> > #4
>> > > (with Xiangrui's final suggestion, especially making it sealed &
>> > available
>> > > in Java), but I really think #2, java enums, are the best option.
>> > >
>> > > Java enums actually have some very real advantages over the other
>> > > approaches -- you get values(), valueOf(), EnumSet, and EnumMap.  There
>> > has
>> > > been endless debate in the Scala community about the problems with the
>> > > approaches in Scala.  Very smart, level-headed Scala gurus have
>> > complained
>> > > about their short-comings (Rex Kerr's name is coming to mind, though
>> I'm
>> > > not positive about that); there have been numerous well-thought out
>> > > proposals to give Scala a better enum.  But the powers-that-be in Scala
>> > > always reject them.  IIRC the explanation for rejecting is basically
>> that
>> > > (a) enums aren't important enough for introducing some new special
>> > feature,
>> > > scala's got bigger things to work on and (b) if you really need a good
>> > > enum, just use java's enum.
>> > >
>> > > I doubt it really matters that much for Spark internals, which is why I
>> > > think #4 is fine.  But I figured I'd give my spiel, because every
>> > developer
>> > > loves language wars :)
>> > >
>> > > Imran
>> > >
>> > >
>> > >
>> > > On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng 
>> wrote:
>> > >

Cross cutting internal changes to launch scripts

2015-03-09 Thread Patrick Wendell
Hey All,

Marcelo Vanzin has been working on a patch for a few months that
performs cross cutting clean-up and fixes to the way that Spark's
launch scripts work (including PySpark, spark submit, the daemon
scripts, etc.). The changes won't modify any public API's in terms of
how those scripts are invoked.

Historically, such patches have been difficult to test due to the
number of interactions between components and interactions with
external environments. I'd like to welcome people to test and/or code
review this patch in their own environment. This patch is the in the
very late stages of review and will likely be merged soon into master
(eventually 1.4).

https://github.com/apache/spark/pull/3916/files

I'll ping this thread again once it is merged and we can establish a
JIRA to encapsulate any issues. Just wanted to give a heads up as this
is one of the larger internal changes we've made to this
infrastructure since Spark 1.0

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Updated] (SPARK-6050) Spark on YARN does not work --executor-cores is specified

2015-03-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6050:
---
Fix Version/s: (was: 1.4.0)

> Spark on YARN does not work --executor-cores is specified
> -
>
> Key: SPARK-6050
> URL: https://issues.apache.org/jira/browse/SPARK-6050
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
> Environment: 2.5 based YARN cluster.
>Reporter: Mridul Muralidharan
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.3.0
>
>
> There are multiple issues here (which I will detail as comments), but to 
> reproduce running the following ALWAYS hangs in our cluster with the 1.3 RC
> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
> yarn-cluster --executor-cores 8--num-executors 15 --driver-memory 4g  
>--executor-memory 2g  --queue webmap lib/spark-examples*.jar   
>   10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Patrick Wendell
Hey All,

Today there was a JIRA posted with an observed regression around Spark
Streaming during certain recovery scenarios:

https://issues.apache.org/jira/browse/SPARK-6222

My preference is to go ahead and ship this release (RC3) as-is and if
this issue is isolated resolved soon, we can make a patch release in
the next week or two.

At some point, the cost of continuing to hold the release re/vote is
so high that it's better to just ship the release. We can document
known issues and point users to a fix once it's available. We did this
in 1.2.0 as well (there were two small known issues) and I think as a
point of process, this approach is necessary given the size of the
project.

I wanted to notify this thread though, in case this change anyones
opinion on their release vote. I will leave the thread open at least
until the end of today.

Still +1 on RC3, for me.

- Patrick

On Mon, Mar 9, 2015 at 9:36 AM, Denny Lee  wrote:
> +1 (non-binding)
>
> Spark Standalone and YARN on Hadoop 2.6 on OSX plus various tests (MLLib,
> SparkSQL, etc.)
>
> On Mon, Mar 9, 2015 at 9:18 AM Tom Graves 
> wrote:
>>
>> +1. Built from source and ran Spark on yarn on hadoop 2.6 in cluster and
>> client mode.
>> Tom
>>
>>  On Thursday, March 5, 2015 8:53 PM, Patrick Wendell
>>  wrote:
>>
>>
>>  Please vote on releasing the following candidate as Apache Spark version
>> 1.3.0!
>>
>> The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4):
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc3/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> Staging repositories for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1078
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc3-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.3.0!
>>
>> The vote is open until Monday, March 09, at 02:52 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == How does this compare to RC2 ==
>> This release includes the following bug fixes:
>>
>> https://issues.apache.org/jira/browse/SPARK-6144
>> https://issues.apache.org/jira/browse/SPARK-6171
>> https://issues.apache.org/jira/browse/SPARK-5143
>> https://issues.apache.org/jira/browse/SPARK-6182
>> https://issues.apache.org/jira/browse/SPARK-6175
>>
>> == How can I help test this release? ==
>> If you are a Spark user, you can help us test this release by
>> taking a Spark 1.2 workload and running on this release candidate,
>> then reporting any regressions.
>>
>> If you are happy with this release based on your own testing, give a +1
>> vote.
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening towards the end of the 1.3 QA period,
>> so -1 votes should only occur for significant regressions from 1.2.1.
>> Bugs already present in 1.2.X, minor regressions, or bugs related
>> to new features will not block this release.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Block Transfer Service encryption support

2015-03-08 Thread Patrick Wendell
I think that yes, longer term we want to have encryption of all
communicated data. However Jeff, can you open a JIRA to discuss the
design before opening a pull request (it's fine to link to a WIP
branch if you'd like)? I'd like to better understand the performance
and operational complexity of using SSL for this in comparison with
alternatives. It would also be good to look at how the Hadoop
encryption works for their shuffle service, in terms of the design
decisions made there.

- Patrick

On Sun, Mar 8, 2015 at 5:42 PM, Jeff Turpin  wrote:
> I have already written most of the code, just finishing up the unit tests
> right now...
>
> Jeff
>
>
> On Sun, Mar 8, 2015 at 5:39 PM, Andrew Ash  wrote:
>
>> I'm interested in seeing this data transfer occurring over encrypted
>> communication channels as well.  Many customers require that all network
>> transfer occur encrypted to prevent the "soft underbelly" that's often
>> found inside a corporate network.
>>
>> On Fri, Mar 6, 2015 at 4:20 PM, turp1twin  wrote:
>>
>>> Is there a plan to implement SSL support for the Block Transfer Service
>>> (specifically, the NettyBlockTransferService implementation)? I can
>>> volunteer if needed...
>>>
>>> Jeff
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2015-03-08 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352356#comment-14352356
 ] 

Patrick Wendell commented on SPARK-1239:


It would be helpful if any users who have observed this could comment on the 
JIRA and give workload information. This has been more on the back burner since 
we've heard few reports of it on the mailing list, etc...

> Don't fetch all map output statuses at each reducer during shuffles
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Patrick Wendell
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-08 Thread Patrick Wendell
I think it's important to separate the goals from the implementation.
I agree with Matei on the goal - I think the goal needs to be to allow
people to download Apache Spark and use it with CDH, HDP, MapR,
whatever... This is the whole reason why HDFS and YARN have stable
API's, so that other projects can build on them in a way that works
across multiple versions. I wouldn't want to force users to upgrade
according only to some vendor timetable, that doesn't seem from the
ASF perspective like a good thing for the project. If users want to
get packages from Bigtop, or the vendors, that's totally fine too.

My point earlier was - I am not sure we are actually accomplishing
that goal now, because I've heard in some cases our "Hadoop 2.X"
packages actually don't work on certain distributions, even those that
are based on that Hadoop version. So one solution is to move towards
"bring your own Hadoop" binaries and have users just set HADOOP_HOME
and maybe document any vendor-specific configs that need to be set.
That also happens to solve the "too many binaries" problem, but only
incidentally.

- Patrick

On Sun, Mar 8, 2015 at 4:07 PM, Matei Zaharia  wrote:
> Our goal is to let people use the latest Apache release even if vendors fall 
> behind or don't want to package everything, so that's why we put out releases 
> for vendors' versions. It's fairly low overhead.
>
> Matei
>
>> On Mar 8, 2015, at 5:56 PM, Sean Owen  wrote:
>>
>> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
>> Maven artifacts.
>>
>> Patrick I see you just commented on SPARK-5134 and will follow up
>> there. Sounds like this may accidentally not be a problem.
>>
>> On binary tarball releases, I wonder if anyone has an opinion on my
>> opinion that these shouldn't be distributed for specific Hadoop
>> *distributions* to begin with. (Won't repeat the argument here yet.)
>> That resolves this n x m explosion too.
>>
>> Vendors already provide their own distribution, yes, that's their job.
>>
>>
>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar  wrote:
>>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
>>> Distributions X ...
>>>
>>> May be one option is to have a minimum basic set (which I know is what we
>>> are discussing) and move the rest to spark-packages.org. There the vendors
>>> can add the latest downloads - for example when 1.4 is released, HDP can
>>> build a release of HDP Spark 1.4 bundle.
>>>
>>> Cheers
>>> 
>>>
>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell  wrote:
>>>>
>>>> We probably want to revisit the way we do binaries in general for
>>>> 1.4+. IMO, something worth forking a separate thread for.
>>>>
>>>> I've been hesitating to add new binaries because people
>>>> (understandably) complain if you ever stop packaging older ones, but
>>>> on the other hand the ASF has complained that we have too many
>>>> binaries already and that we need to pare it down because of the large
>>>> volume of files. Doubling the number of binaries we produce for Scala
>>>> 2.11 seemed like it would be too much.
>>>>
>>>> One solution potentially is to actually package "Hadoop provided"
>>>> binaries and encourage users to use these by simply setting
>>>> HADOOP_HOME, or have instructions for specific distros. I've heard
>>>> that our existing packages don't work well on HDP for instance, since
>>>> there are some configuration quirks that differ from the upstream
>>>> Hadoop.
>>>>
>>>> If we cut down on the cross building for Hadoop versions, then it is
>>>> more tenable to cross build for Scala versions without exploding the
>>>> number of binaries.
>>>>
>>>> - Patrick
>>>>
>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen  wrote:
>>>>> Yeah, interesting question of what is the better default for the
>>>>> single set of artifacts published to Maven. I think there's an
>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
>>>>> and cons discussed more at
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-5134
>>>>> https://github.com/apache/spark/pull/3917
>>>>>
>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei 

[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+

2015-03-08 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352341#comment-14352341
 ] 

Patrick Wendell commented on SPARK-5134:


[~shivaram] did it end up working alright if you just excluded Spark's Hadoop 
dependency? If so we can just document this.

> Bump default Hadoop version to 2+
> -
>
> Key: SPARK-5134
> URL: https://issues.apache.org/jira/browse/SPARK-5134
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> [~srowen] and I discussed bumping [the default hadoop version in the parent 
> POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]
>  from {{1.0.4}} to something more recent.
> There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5134) Bump default Hadoop version to 2+

2015-03-08 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352268#comment-14352268
 ] 

Patrick Wendell edited comment on SPARK-5134 at 3/8/15 11:27 PM:
-

Hey [~rdub] [~srowen],

As part of the 1.3 release cycle I did some more forensics on the actual 
artifacts we publish. It turns out that because of the changes made for Scala 
2.11 with the way our publishing works, we've actually been publishing poms 
that link against Hadoop 2.2 as of Spark 1.2. And in general, the published pom 
Hadoop version is decoupled now from the default one in the build itself, 
because of our use of the effective pom plugin.

https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L119

I'm actually a bit bummed that we (unintentionally) made this change in 1.2 
because I do fear it likely screwed things up for some users.

But on the plus side, since we now decouple the publishing from the default 
version in the pom, I don't see a big issue with updating the POM. So I 
withdraw my objection on the PR.


was (Author: pwendell):
Hey [~rdub] [~srowen],

As part of the 1.3 release cycle I did some more forensics on the actual 
artifacts we publish. It turns out that because of the changes made for Scala 
2.11 with the way our publishing works, we've actually been publishing poms 
that link against Hadoop 2.2 as of Spark 1.2. And in general, the published pom 
Hadoop version is decoupled now from the default one in the build itself, 
because of our use of the effective pom plugin.

https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L119

I'm actually a bit bummed that we (unintentionally) made this change in 1.2 
because I do fear it likely screwed things up for some users.

But on the plus side, since we no decouple the publishing from the default 
version in the pom, I don't see a big issue with updating the POM. So I 
withdraw my objection on the PR.

> Bump default Hadoop version to 2+
> -
>
> Key: SPARK-5134
> URL: https://issues.apache.org/jira/browse/SPARK-5134
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> [~srowen] and I discussed bumping [the default hadoop version in the parent 
> POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]
>  from {{1.0.4}} to something more recent.
> There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+

2015-03-08 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352268#comment-14352268
 ] 

Patrick Wendell commented on SPARK-5134:


Hey [~rdub] [~srowen],

As part of the 1.3 release cycle I did some more forensics on the actual 
artifacts we publish. It turns out that because of the changes made for Scala 
2.11 with the way our publishing works, we've actually been publishing poms 
that link against Hadoop 2.2 as of Spark 1.2. And in general, the published pom 
Hadoop version is decoupled now from the default one in the build itself, 
because of our use of the effective pom plugin.

https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L119

I'm actually a bit bummed that we (unintentionally) made this change in 1.2 
because I do fear it likely screwed things up for some users.

But on the plus side, since we no decouple the publishing from the default 
version in the pom, I don't see a big issue with updating the POM. So I 
withdraw my objection on the PR.

> Bump default Hadoop version to 2+
> -
>
> Key: SPARK-5134
> URL: https://issues.apache.org/jira/browse/SPARK-5134
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> [~srowen] and I discussed bumping [the default hadoop version in the parent 
> POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]
>  from {{1.0.4}} to something more recent.
> There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-08 Thread Patrick Wendell
We probably want to revisit the way we do binaries in general for
1.4+. IMO, something worth forking a separate thread for.

I've been hesitating to add new binaries because people
(understandably) complain if you ever stop packaging older ones, but
on the other hand the ASF has complained that we have too many
binaries already and that we need to pare it down because of the large
volume of files. Doubling the number of binaries we produce for Scala
2.11 seemed like it would be too much.

One solution potentially is to actually package "Hadoop provided"
binaries and encourage users to use these by simply setting
HADOOP_HOME, or have instructions for specific distros. I've heard
that our existing packages don't work well on HDP for instance, since
there are some configuration quirks that differ from the upstream
Hadoop.

If we cut down on the cross building for Hadoop versions, then it is
more tenable to cross build for Scala versions without exploding the
number of binaries.

- Patrick

On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen  wrote:
> Yeah, interesting question of what is the better default for the
> single set of artifacts published to Maven. I think there's an
> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
> and cons discussed more at
>
> https://issues.apache.org/jira/browse/SPARK-5134
> https://github.com/apache/spark/pull/3917
>
> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia  wrote:
>> +1
>>
>> Tested it on Mac OS X.
>>
>> One small issue I noticed is that the Scala 2.11 build is using Hadoop 1 
>> without Hive, which is kind of weird because people will more likely want 
>> Hadoop 2 with Hive. So it would be good to publish a build for that 
>> configuration instead. We can do it if we do a new RC, or it might be that 
>> binary builds may not need to be voted on (I forgot the details there).
>>
>> Matei

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Updated] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

2015-03-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6189:
---
Component/s: DataFrame

> Pandas to DataFrame conversion should check field names for periods
> ---
>
> Key: SPARK-6189
> URL: https://issues.apache.org/jira/browse/SPARK-6189
> Project: Spark
>  Issue Type: Improvement
>  Components: DataFrame, SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Issue I ran into:  I imported an R dataset in CSV format into a Pandas 
> DataFrame and then use toDF() to convert that into a Spark DataFrame.  The R 
> dataset had a column with a period in it (column "GNP.deflator" in the 
> "longley" dataset).  When I tried to select it using the Spark DataFrame DSL, 
> I could not because the DSL thought the period was selecting a field within 
> GNP.
> Also, since "GNP" is another field's name, it gives an error which could be 
> obscure to users, complaining:
> {code}
> org.apache.spark.sql.AnalysisException: GetField is not valid on fields of 
> type DoubleType;
> {code}
> We should either handle periods in column names or check during loading and 
> warn/fail gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6208) executor-memory does not work when using local cluster

2015-03-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6208:
---
Issue Type: New Feature  (was: Bug)

> executor-memory does not work when using local cluster
> --
>
> Key: SPARK-6208
> URL: https://issues.apache.org/jira/browse/SPARK-6208
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Reporter: Yin Huai
>
> Seems executor memory set with a local cluster is not correctly set (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L377).
>  Also, totalExecutorCores seems has the same issue 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L379).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6208) executor-memory does not work when using local cluster

2015-03-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14351951#comment-14351951
 ] 

Patrick Wendell commented on SPARK-6208:


I'm not sure we've ever tested much using spark-submit with local-cluster. 
That's really an internal thing for testing, so I'm gonna re-categorize this as 
a feature request. [~yhuai] are you using that for some local experiments or 
something?

> executor-memory does not work when using local cluster
> --
>
> Key: SPARK-6208
> URL: https://issues.apache.org/jira/browse/SPARK-6208
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Reporter: Yin Huai
>
> Seems executor memory set with a local cluster is not correctly set (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L377).
>  Also, totalExecutorCores seems has the same issue 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L379).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6207) YARN secure cluster mode doesn't obtain a hive-metastore token

2015-03-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6207:
---
Component/s: SQL

> YARN secure cluster mode doesn't obtain a hive-metastore token 
> ---
>
> Key: SPARK-6207
> URL: https://issues.apache.org/jira/browse/SPARK-6207
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL, YARN
>Affects Versions: 1.2.0, 1.3.0, 1.2.1
> Environment: YARN
>Reporter: Doug Balog
>
> When running a spark job, on YARN in secure mode, with "--deploy-mode 
> cluster",  org.apache.spark.deploy.yarn.Client() does not obtain a delegation 
> token to the hive-metastore. Therefore any attempts to talk to the 
> hive-metastore fail with a "GSSException: No valid credentials provided..."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4123) Show new dependencies added in pull requests

2015-03-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4123:
---
Assignee: Brennon York

> Show new dependencies added in pull requests
> 
>
> Key: SPARK-4123
> URL: https://issues.apache.org/jira/browse/SPARK-4123
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>        Reporter: Patrick Wendell
>Assignee: Brennon York
>Priority: Critical
>
> We should inspect the classpath of Spark's assembly jar for every pull 
> request. This only takes a few seconds in Maven and it will help weed out 
> dependency changes from the master branch. Ideally we'd post any dependency 
> changes in the pull request message.
> {code}
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
> $ git checkout apache/master
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
> $ diff my-classpath master-classpath
> < chill-java-0.3.6.jar
> < chill_2.10-0.3.6.jar
> ---
> > chill-java-0.5.0.jar
> > chill_2.10-0.5.0.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4123) Show new dependencies added in pull requests

2015-03-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14351932#comment-14351932
 ] 

Patrick Wendell commented on SPARK-4123:


Hey [~boyork] sorry for the delay. Are you still interested in doing this one? 
You are right, the current approach requires a maven install, which won't work 
well on jenkins becuase there are multiple pull request builds that share the 
same repository. Unfortunately the maven "-pl" flag requires an install... it's 
pretty annoying that it can't reason locally about the fact that it's part of a 
multi project build. One thought I had was that it might be possible to just do 
a mvn install into a local directory that is part of the specific build folder. 
Some local testing revealed that even though maven supposedly supports setting 
the localRepositoryPath option during installs, it doesn't seem to work.

Anyways, I came up with another way to do it. It's pretty brittle but it does 
seem to work:

{code}
mvn dependency:build-classpath | grep -A 5 "Building Spark Project Assembly" | 
tail -n 1 | tr ":" "\n" | rev | cut -d "/" -f 1 | rev | sort > pr_path
{code}

I think using this we can make it work. I just tested it with the SPARK-6122 
JIRA and it seemed to work well.

{code}
> diff pr_path master_path 
118,119c118,119
< tachyon-0.6.0.jar
< tachyon-client-0.6.0.jar
---
> tachyon-0.5.0.jar
> tachyon-client-0.5.0.jar
{code}

> Show new dependencies added in pull requests
> 
>
> Key: SPARK-4123
> URL: https://issues.apache.org/jira/browse/SPARK-4123
>         Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Priority: Critical
>
> We should inspect the classpath of Spark's assembly jar for every pull 
> request. This only takes a few seconds in Maven and it will help weed out 
> dependency changes from the master branch. Ideally we'd post any dependency 
> changes in the pull request message.
> {code}
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
> $ git checkout apache/master
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
> $ diff my-classpath master-classpath
> < chill-java-0.3.6.jar
> < chill_2.10-0.3.6.jar
> ---
> > chill-java-0.5.0.jar
> > chill_2.10-0.5.0.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5183) Document data source API

2015-03-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5183:
---
Priority: Critical  (was: Blocker)

> Document data source API
> 
>
> Key: SPARK-5183
> URL: https://issues.apache.org/jira/browse/SPARK-5183
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Reporter: Yin Huai
>Priority: Critical
>
> We need to document the data types the caller needs to support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5310) Update SQL programming guide for 1.3

2015-03-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5310:
---
Priority: Critical  (was: Blocker)

> Update SQL programming guide for 1.3
> 
>
> Key: SPARK-5310
> URL: https://issues.apache.org/jira/browse/SPARK-5310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> We make quite a few changes. We should update the SQL programming guide to 
> reflect these changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6128) Update Spark Streaming Guide for Spark 1.3

2015-03-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6128:
---
Priority: Critical  (was: Blocker)

> Update Spark Streaming Guide for Spark 1.3
> --
>
> Key: SPARK-6128
> URL: https://issues.apache.org/jira/browse/SPARK-6128
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Things to update
> - New Kafka Direct API
> - Python Kafka API
> - Add joins to streaming guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-06 Thread Patrick Wendell
For now, I'll just put this as critical. We can discuss the
documentation stuff offline or in another thread.

On Fri, Mar 6, 2015 at 1:36 PM, Sean Owen  wrote:
> Although the problem is small, especially if indeed the essential docs
> changes are following just a couple days behind the final release, I
> mean, why the rush if they're essential? wait a couple days, finish
> them, make the release.
>
> Answer is, I think these changes aren't actually essential given the
> comment from tdas, so: just mark these Critical? (although ... they do
> say they're changes for the 1.3 release, so kind of funny to get to
> them for 1.3.x or 1.4, but that's not important now.)
>
> I thought that Blocker really meant Blocker in this project, as I've
> been encouraged to use it to mean "don't release without this." I
> think we should use it that way. Just thinking of it as "extra
> Critical" doesn't add anything. I don't think Documentation should be
> special-cased as less important, and I don't think there's confusion
> if Blocker means what it says, so I'd 'fix' that way.
>
> If nobody sees the Hive failure I observed, and if we can just zap
> those "Blockers" one way or the other, +1
>
>
> On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell  wrote:
>> Sean,
>>
>> The docs are distributed and consumed in a fundamentally different way
>> than Spark code itself. So we've always considered the "deadline" for
>> doc changes to be when the release is finally posted.
>>
>> If there are small inconsistencies with the docs present in the source
>> code for that release tag, IMO that doesn't matter much since we don't
>> even distribute the docs with Spark's binary releases and virtually no
>> one builds and hosts the docs on their own (that I am aware of, at
>> least). Perhaps we can recommend if people want to build the doc
>> sources that they should always grab the head of the most recent
>> release branch, to set expectations accordingly.
>>
>> In the past we haven't considered it worth holding up the release
>> process for the purpose of the docs. It just doesn't make sense since
>> they are consumed "as a service". If we decide to change this
>> convention, it would mean shipping our releases later, since we
>> could't pipeline the doc finalization with voting.
>>
>> - Patrick
>>
>> On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen  wrote:
>>> Given the title and tagging, it sounds like there could be some
>>> must-have doc changes to go with what is being released as 1.3. It can
>>> be finished later, and published later, but then the docs source
>>> shipped with the release doesn't match the site, and until then, 1.3
>>> is released without some "must-have" docs for 1.3 on the site.
>>>
>>> The real question to me is: are there any further, absolutely
>>> essential doc changes that need to accompany 1.3 or not?
>>>
>>> If not, just resolve these. If there are, then it seems like the
>>> release has to block on them. If there are some docs that should have
>>> gone in for 1.3, but didn't, but aren't essential, well I suppose it
>>> bears thinking about how to not slip as much work, but it doesn't
>>> block.
>>>
>>> I think Documentation issues certainly can be a blocker and shouldn't
>>> be specially ignored.
>>>
>>>
>>> BTW the UISeleniumSuite issue is a real failure, but I do not think it
>>> is serious: http://issues.apache.org/jira/browse/SPARK-6205  It isn't
>>> a regression from 1.2.x, but only affects tests, and only affects a
>>> subset of build profiles.
>>>
>>>
>>>
>>>
>>> On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell  wrote:
>>>> Hey Sean,
>>>>
>>>>> SPARK-5310 Update SQL programming guide for 1.3
>>>>> SPARK-5183 Document data source API
>>>>> SPARK-6128 Update Spark Streaming Guide for Spark 1.3
>>>>
>>>> For these, the issue is that they are documentation JIRA's, which
>>>> don't need to be timed exactly with the release vote, since we can
>>>> update the documentation on the website whenever we want. In the past
>>>> I've just mentally filtered these out when considering RC's. I see a
>>>> few options here:
>>>>
>>>> 1. We downgrade such issues away from Blocker (more clear, but we risk
>>>> loosing them in the fray if they really are things we want to have
>>>> before the release is posted).
>>>> 2. We provide a filter to the community that excludes 'Documentation'
>>>> issues and shows all other blockers for 1.3. We can put this on the
>>>> wiki, for instance.
>>>>
>>>> Which do you prefer?
>>>>
>>>> - Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Commented] (SPARK-6154) Build error with Scala 2.11 for v1.3.0-rc2

2015-03-06 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350975#comment-14350975
 ] 

Patrick Wendell commented on SPARK-6154:


Oh I remember now, we don't support this because there is a dependency conflict 
between hive's thriftserver JLine and the JLine used by Scala 2.11. It says the 
following in the docs:

"Scala 2.11 support in Spark is experimental and does not support a few 
features. Specifically, Spark’s external Kafka library and JDBC component are 
not yet supported in Scala 2.11 builds."

> Build error with Scala 2.11 for v1.3.0-rc2
> --
>
> Key: SPARK-6154
> URL: https://issues.apache.org/jira/browse/SPARK-6154
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Jianshi Huang
>
> Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
> failed when -Phive-thriftserver is enabled.
> [info] Compiling 9 Scala sources to 
> /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
> [error] 
> /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
> 5: object ConsoleReader is not a member of package jline
> [error] import jline.{ConsoleReader, History}
> [error]^
> [warn] Class jline.Completor not found - continuing with a stub.
> [warn] Class jline.ConsoleReader not found - continuing with a stub.
> [error] 
> /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
> 65: not found: type ConsoleReader
> [error] val reader = new ConsoleReader()
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6154) Build error with Scala 2.11 for v1.3.0-rc2

2015-03-06 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350915#comment-14350915
 ] 

Patrick Wendell commented on SPARK-6154:


Can you give the exact set of flags you are passing to maven when building?

> Build error with Scala 2.11 for v1.3.0-rc2
> --
>
> Key: SPARK-6154
> URL: https://issues.apache.org/jira/browse/SPARK-6154
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Jianshi Huang
>
> Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
> failed when -Phive-thriftserver is enabled.
> [info] Compiling 9 Scala sources to 
> /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
> [error] 
> /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
> 5: object ConsoleReader is not a member of package jline
> [error] import jline.{ConsoleReader, History}
> [error]^
> [warn] Class jline.Completor not found - continuing with a stub.
> [warn] Class jline.ConsoleReader not found - continuing with a stub.
> [error] 
> /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
> 65: not found: type ConsoleReader
> [error] val reader = new ConsoleReader()
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6154) Build error with Scala 2.11 for v1.3.0-rc2

2015-03-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6154:
---
Component/s: (was: SQL)
 Build

> Build error with Scala 2.11 for v1.3.0-rc2
> --
>
> Key: SPARK-6154
> URL: https://issues.apache.org/jira/browse/SPARK-6154
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Jianshi Huang
>
> Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
> failed when -Phive-thriftserver is enabled.
> [info] Compiling 9 Scala sources to 
> /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
> [error] 
> /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
> 5: object ConsoleReader is not a member of package jline
> [error] import jline.{ConsoleReader, History}
> [error]^
> [warn] Class jline.Completor not found - continuing with a stub.
> [warn] Class jline.ConsoleReader not found - continuing with a stub.
> [error] 
> /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
> 65: not found: type ConsoleReader
> [error] val reader = new ConsoleReader()
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-06 Thread Patrick Wendell
Sean,

The docs are distributed and consumed in a fundamentally different way
than Spark code itself. So we've always considered the "deadline" for
doc changes to be when the release is finally posted.

If there are small inconsistencies with the docs present in the source
code for that release tag, IMO that doesn't matter much since we don't
even distribute the docs with Spark's binary releases and virtually no
one builds and hosts the docs on their own (that I am aware of, at
least). Perhaps we can recommend if people want to build the doc
sources that they should always grab the head of the most recent
release branch, to set expectations accordingly.

In the past we haven't considered it worth holding up the release
process for the purpose of the docs. It just doesn't make sense since
they are consumed "as a service". If we decide to change this
convention, it would mean shipping our releases later, since we
could't pipeline the doc finalization with voting.

- Patrick

On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen  wrote:
> Given the title and tagging, it sounds like there could be some
> must-have doc changes to go with what is being released as 1.3. It can
> be finished later, and published later, but then the docs source
> shipped with the release doesn't match the site, and until then, 1.3
> is released without some "must-have" docs for 1.3 on the site.
>
> The real question to me is: are there any further, absolutely
> essential doc changes that need to accompany 1.3 or not?
>
> If not, just resolve these. If there are, then it seems like the
> release has to block on them. If there are some docs that should have
> gone in for 1.3, but didn't, but aren't essential, well I suppose it
> bears thinking about how to not slip as much work, but it doesn't
> block.
>
> I think Documentation issues certainly can be a blocker and shouldn't
> be specially ignored.
>
>
> BTW the UISeleniumSuite issue is a real failure, but I do not think it
> is serious: http://issues.apache.org/jira/browse/SPARK-6205  It isn't
> a regression from 1.2.x, but only affects tests, and only affects a
> subset of build profiles.
>
>
>
>
> On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell  wrote:
>> Hey Sean,
>>
>>> SPARK-5310 Update SQL programming guide for 1.3
>>> SPARK-5183 Document data source API
>>> SPARK-6128 Update Spark Streaming Guide for Spark 1.3
>>
>> For these, the issue is that they are documentation JIRA's, which
>> don't need to be timed exactly with the release vote, since we can
>> update the documentation on the website whenever we want. In the past
>> I've just mentally filtered these out when considering RC's. I see a
>> few options here:
>>
>> 1. We downgrade such issues away from Blocker (more clear, but we risk
>> loosing them in the fray if they really are things we want to have
>> before the release is posted).
>> 2. We provide a filter to the community that excludes 'Documentation'
>> issues and shows all other blockers for 1.3. We can put this on the
>> wiki, for instance.
>>
>> Which do you prefer?
>>
>> - Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-06 Thread Patrick Wendell
Hey Sean,

> SPARK-5310 Update SQL programming guide for 1.3
> SPARK-5183 Document data source API
> SPARK-6128 Update Spark Streaming Guide for Spark 1.3

For these, the issue is that they are documentation JIRA's, which
don't need to be timed exactly with the release vote, since we can
update the documentation on the website whenever we want. In the past
I've just mentally filtered these out when considering RC's. I see a
few options here:

1. We downgrade such issues away from Blocker (more clear, but we risk
loosing them in the fray if they really are things we want to have
before the release is posted).
2. We provide a filter to the community that excludes 'Documentation'
issues and shows all other blockers for 1.3. We can put this on the
wiki, for instance.

Which do you prefer?

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



<    3   4   5   6   7   8   9   10   11   12   >