Re: Scala API: simplifying common patterns

2016-02-07 Thread sim
24 test failures for sql/test:
https://gist.github.com/ssimeonov/89862967f87c5c497322



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-API-simplifying-common-patterns-tp16238p16247.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

2016-02-07 Thread Sun, Rui
This should be solved by your pending PR 
https://github.com/apache/spark/pull/10480, right?

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Sunday, February 7, 2016 8:50 PM
To: Sun, Rui ; Andrew Holway 
; dev@spark.apache.org
Subject: RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

I mean not exposed from the SparkR API.
Calling it from R without a SparkR API would require either a serializer change 
or a JVM wrapper function.

On Sun, Feb 7, 2016 at 4:47 AM -0800, "Felix Cheung" 
> wrote:
That does but it's a bit hard to call from R since it is not exposed.



On Sat, Feb 6, 2016 at 11:57 PM -0800, "Sun, Rui" 
> wrote:

DataFrameWrite.jdbc() does not work?



From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Sunday, February 7, 2016 9:54 AM
To: Andrew Holway 
>; 
dev@spark.apache.org
Subject: Re: Fwd: Writing to jdbc database from SparkR (1.5.2)



Unfortunately I couldn't find a simple workaround. It seems to be an issue with 
DataFrameWriter.save() that does not work with jdbc source/format



For instance, this does not work in Scala either

df1.write.format("jdbc").mode("overwrite").option("url", 
"jdbc:mysql://something.rds.amazonaws.com:3306?user=user=password").option("dbtable",
 "table").save()



For Spark 1.5.x, it seems the best option would be to write a JVM wrapper and 
call it from R.



_
From: Andrew Holway 
>
Sent: Saturday, February 6, 2016 11:22 AM
Subject: Fwd: Writing to jdbc database from SparkR (1.5.2)
To: >

Hi,



I have a thread on u...@spark.apache.org but I 
think this might require developer attention.



I'm reading data from a database: This is working well.

> df <- read.df(sqlContext, source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass")



When I try and write something back to the DB I see this following error:



> write.df(fooframe, path="NULL", source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass",
>  dbtable="db.table", mode="append")



16/02/06 19:05:43 ERROR RBackendHandler: save on 2 failed

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :

  java.lang.RuntimeException: 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow 
create table as select.

at scala.sys.package$.error(package.scala:27)

at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)

at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1855)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:132)

at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:79)

at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)

at io.netty.channel.SimpleChannelIn



Any ideas on a workaround?



Thanks,



Andrew




RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

2016-02-07 Thread Felix Cheung
I mean not exposed from the SparkR API.
Calling it from R without a SparkR API would require either a serializer change 
or a JVM wrapper function.



On Sun, Feb 7, 2016 at 4:47 AM -0800, "Felix Cheung" 
 wrote:





That does but it's a bit hard to call from R since it is not exposed.






On Sat, Feb 6, 2016 at 11:57 PM -0800, "Sun, Rui"  wrote:





DataFrameWrite.jdbc() does not work?

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Sunday, February 7, 2016 9:54 AM
To: Andrew Holway ; dev@spark.apache.org
Subject: Re: Fwd: Writing to jdbc database from SparkR (1.5.2)

Unfortunately I couldn't find a simple workaround. It seems to be an issue with 
DataFrameWriter.save() that does not work with jdbc source/format

For instance, this does not work in Scala either
df1.write.format("jdbc").mode("overwrite").option("url", 
"jdbc:mysql://something.rds.amazonaws.com:3306?user=user=password").option("dbtable",
 "table").save()

For Spark 1.5.x, it seems the best option would be to write a JVM wrapper and 
call it from R.

_
From: Andrew Holway 
>
Sent: Saturday, February 6, 2016 11:22 AM
Subject: Fwd: Writing to jdbc database from SparkR (1.5.2)
To: >

Hi,

I have a thread on u...@spark.apache.org but I 
think this might require developer attention.

I'm reading data from a database: This is working well.

> df <- read.df(sqlContext, source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass")

When I try and write something back to the DB I see this following error:


> write.df(fooframe, path="NULL", source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass",
>  dbtable="db.table", mode="append")



16/02/06 19:05:43 ERROR RBackendHandler: save on 2 failed

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :

  java.lang.RuntimeException: 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow 
create table as select.

at scala.sys.package$.error(package.scala:27)

at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)

at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1855)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:132)

at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:79)

at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)

at io.netty.channel.SimpleChannelIn



Any ideas on a workaround?



Thanks,



Andrew



Scala API: simplifying common patterns

2016-02-07 Thread sim
The more Spark code I write, the more I hit the same use cases where the
Scala APIs feel a bit awkward. I'd love to understand if there are
historical reasons for these and whether there is opportunity + interest to
improve the APIs. Here are my top two:
1. registerTempTable() returns Unit
def cachedDF(path: String, tableName: String) = {  val df =
sqlContext.read.load(path).cache()  df.registerTempTable(tableName)  df}//
vs.def cachedDF(path: String, tableName: String) = 
sqlContext.read.load(path).cache().registerTempTable(tableName)
2. No toDF() implicit for creating a DataFrame from an RDD + schema
val schema: StructType = ...val rdd = sc.textFile(...)  .map(...) 
.aggregate(...)val df = sqlContext.createDataFrame(rdd, schema)// vs.val
schema: StructType = ...val df = sc.textFile(...)  .map(...) 
.aggregate(...)  .toDF(schema)
Have you encountered other examples where small, low-risk API tweaks could
make common use cases more consistent + simpler to code?
/Sim



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-API-simplifying-common-patterns-tp16238.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Scala API: simplifying common patterns

2016-02-07 Thread sim
Sure.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-API-simplifying-common-patterns-tp16238p16241.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Scala API: simplifying common patterns

2016-02-07 Thread Reynold Xin
Both of these make sense to add. Can you submit a pull request?


On Sun, Feb 7, 2016 at 3:29 PM, sim  wrote:

> The more Spark code I write, the more I hit the same use cases where the
> Scala APIs feel a bit awkward. I'd love to understand if there are
> historical reasons for these and whether there is opportunity + interest to
> improve the APIs. Here are my top two:
> 1. registerTempTable() returns Unit
>
> def cachedDF(path: String, tableName: String) = {
>   val df = sqlContext.read.load(path).cache()
>   df.registerTempTable(tableName)
>   df
> }
>
> // vs.
>
> def cachedDF(path: String, tableName: String) =
>   sqlContext.read.load(path).cache().registerTempTable(tableName)
>
> 2. No toDF() implicit for creating a DataFrame from an RDD + schema
>
> val schema: StructType = ...
> val rdd = sc.textFile(...)
>   .map(...)
>   .aggregate(...)
> val df = sqlContext.createDataFrame(rdd, schema)
>
> // vs.
>
> val schema: StructType = ...
> val df = sc.textFile(...)
>   .map(...)
>   .aggregate(...)
>   .toDF(schema)
>
> Have you encountered other examples where small, low-risk API tweaks could
> make common use cases more consistent + simpler to code?
>
> /Sim
> --
> View this message in context: Scala API: simplifying common patterns
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>
>


Re: Scala API: simplifying common patterns

2016-02-07 Thread Reynold Xin
Not 100% sure what's going on, but you can try wiping your local ivy2 and
maven cache.




On Mon, Feb 8, 2016 at 12:05 PM, sim  wrote:

> Reynold, I just forked + built master and I'm getting lots of binary
> compatibility errors when running the tests.
>
> https://gist.github.com/ssimeonov/69cb0b41750be776
>
> Nothing in the dev tools section of the wiki on this. Any advice on how to
> get green before I work on the PRs?
>
> Thanks,
> Sim
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-API-simplifying-common-patterns-tp16238p16242.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

2016-02-07 Thread Felix Cheung
Correct :)



_
From: Sun, Rui 
Sent: Sunday, February 7, 2016 5:19 AM
Subject: RE: Fwd: Writing to jdbc database from SparkR (1.5.2)
To:  , Felix Cheung , Andrew 
Holway 


 

This should be solved by your pending PR 
https://github.com/apache/spark/pull/10480, right?

    

From: Felix Cheung [mailto:felixcheun...@hotmail.com] 
 Sent: Sunday, February 7, 2016 8:50 PM
 To: Sun, Rui ; Andrew Holway 
; dev@spark.apache.org
 Subject: RE: Fwd: Writing to jdbc database from SparkR (1.5.2) 

 

I mean not exposed from the SparkR API.
 Calling it from R without a SparkR API would require either a serializer 
change or a JVM wrapper function.
 
  

On Sun, Feb 7, 2016 at 4:47 AM -0800, "Felix Cheung" 
 wrote:   

That does but it's a bit hard to call from R since it is not exposed.  

   


 


On Sat, Feb 6, 2016 at 11:57 PM -0800, "Sun, Rui"  wrote:
   

DataFrameWrite.jdbc() does not work?   

    

From: Felix Cheung [mailto:felixcheun...@hotmail.com] 
 Sent: Sunday, February 7, 2016 9:54 AM
 To: Andrew Holway ; dev@spark.apache.org
 Subject: Re: Fwd: Writing to jdbc database from SparkR (1.5.2)   

 

Unfortunately I couldn't find a simple workaround. It seems to be an issue with 
DataFrameWriter.save() that does not work with jdbc source/format   
  

  

For instance, this does not work in Scala either 

df1.write.format("jdbc").mode("overwrite").option("url", 
"jdbc:mysql://something.rds.amazonaws.com:3306?user=user=password").option("dbtable",
 "table").save()             

  

For Spark 1.5.x, it seems the best option would be to write a JVM wrapper and 
call it from R. 

   

_
 From: Andrew Holway 
 Sent: Saturday, February 6, 2016 11:22 AM
 Subject: Fwd: Writing to jdbc database from SparkR (1.5.2)
 To:  

Hi,

 

I have a thread on  u...@spark.apache.org but I think this might require 
developer attention.   

    

I'm reading data from a database: This is working well. 
 

> df <- read.df(sqlContext, source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass")
>

    

When I try and write something back to the DB I see this following error:   
 

 

> write.df(fooframe, path="NULL", source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass",
>  dbtable="db.table", mode="append") 

  

16/02/06 19:05:43 ERROR RBackendHandler: save on 2 failed 

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :  

  java.lang.RuntimeException: 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow 
create table as select. 

at scala.sys.package$.error(package.scala:27) 

at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
 

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) 


at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1855) 

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
  

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 

at java.lang.reflect.Method.invoke(Method.java:497) 

at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:132)
 

at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:79)   
  

at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)   
  

at io.netty.channel.SimpleChannelIn 

  

Any ideas on a workaround? 

  

Thanks, 

  

Andrew  

  


  

Re: Scala API: simplifying common patterns

2016-02-07 Thread sim
Same result with both caches cleared.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-API-simplifying-common-patterns-tp16238p16244.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Scala API: simplifying common patterns

2016-02-07 Thread Reynold Xin
Yea I'm not sure what's going on either. You can just run the unit tests
through "build/sbt sql/test" without running mima.


On Mon, Feb 8, 2016 at 3:47 PM, sim  wrote:

> Same result with both caches cleared.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-API-simplifying-common-patterns-tp16238p16244.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Preserving partitioning with dataframe select

2016-02-07 Thread Reynold Xin
Matt,

Thanks for the email. Are you just asking whether it should work, or
reporting they don't work?

Internally, the way we track physical data distribution should make the
scenarios described work. If it doesn't, we should make them work.


On Sat, Feb 6, 2016 at 6:49 AM, Matt Cheah  wrote:

> Hi everyone,
>
> When using raw RDDs, it is possible to have a map() operation indicate
> that the partitioning for the RDD would be preserved by the map operation.
> This makes it easier to reduce the overhead of shuffles by ensuring that
> RDDs are co-partitioned when they are joined.
>
> When I'm using Data Frames, I'm pre-partitioning the data frame by using
> DataFrame.partitionBy($"X"), but I will invoke a select statement after the
> partitioning before joining that dataframe with another. Roughly speaking,
> I'm doing something like this pseudo-code:
>
> partitionedDataFrame = dataFrame.partitionBy("$X")
> groupedDataFrame = partitionedDataFrame.groupBy($"X").agg(aggregations)
> // Rename "X" to "Y" to make sure columns are unique
> groupedDataFrameRenamed = groupedDataFrame.withColumnRenamed("X", "Y")
> // Roughly speaking, join on "X == Y" to get the aggregation results onto
> every row
> joinedDataFrame = partitionedDataFrame.join(groupedDataFrame)
>
> However the renaming of the columns maps to a select statement, and to my
> knowledge, selecting the columns is throwing off the partitioning which
> results in shuffle both the partitionedDataFrame and the groupedDataFrame.
>
> I have the following questions given this example:
>
> 1) Is pre-partitioning the Data Frame effective? In other words, does the
> physical planner recognize when underlying RDDs are co-partitioned and
> compute more efficient joins by reducing the amount of data that is
> shuffled?
> 2) If the planner takes advantage of co-partitioning, is the renaming of
> the columns invalidating the partitioning of the grouped Data Frame? When I
> look at the planner's conversion from logical.Project to the physical plan,
> I only see it invoking child.mapPartitions without specifying the
> preservesPartitioning flag.
>
> Thanks,
>
> -Matt Cheah
>