RE: Join implementation in SparkSQL

2015-01-15 Thread Cheng, Hao
Not so sure about your question, but the SparkStrategies.scala and 
Optimizer.scala is a good start if you want to get details of the join 
implementation or optimization.

-Original Message-
From: Andrew Ash [mailto:and...@andrewash.com] 
Sent: Friday, January 16, 2015 4:52 AM
To: Reynold Xin
Cc: Alessandro Baretta; dev@spark.apache.org
Subject: Re: Join implementation in SparkSQL

What Reynold is describing is a performance optimization in implementation, but 
the semantics of the join (cartesian product plus relational algebra
filter) should be the same and produce the same results.

On Thu, Jan 15, 2015 at 1:36 PM, Reynold Xin  wrote:

> It's a bunch of strategies defined here:
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/or
> g/apache/spark/sql/execution/SparkStrategies.scala
>
> In most common use cases (e.g. inner equi join), filters are pushed 
> below the join or into the join. Doing a cartesian product followed by 
> a filter is too expensive.
>
>
> On Thu, Jan 15, 2015 at 7:39 AM, Alessandro Baretta 
>  >
> wrote:
>
> > Hello,
> >
> > Where can I find docs about how joins are implemented in SparkSQL? 
> > In particular, I'd like to know whether they are implemented 
> > according to their relational algebra definition as filters on top 
> > of a cartesian product.
> >
> > Thanks,
> >
> > Alex
> >
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: LinearRegressionWithSGD accuracy

2015-01-15 Thread Devl Devel
It was a bug in the code, however adding the step parameter got the results
to work.  Mean Squared Error = 2.610379825794694E-5

I've also opened a jira to put the step parameter in the examples so that
people new to mllib have a way to improve the MSE.

https://issues.apache.org/jira/browse/SPARK-5273

On Thu, Jan 15, 2015 at 8:23 PM, Joseph Bradley 
wrote:

> It looks like you're training on the non-scaled data but testing on the
> scaled data.  Have you tried this training & testing on only the scaled
> data?
>
> On Thu, Jan 15, 2015 at 10:42 AM, Devl Devel 
> wrote:
>
>> Thanks, that helps a bit at least with the NaN but the MSE is still very
>> high even with that step size and 10k iterations:
>>
>> training Mean Squared Error = 3.3322561285919316E7
>>
>> Does this method need say 100k iterations?
>>
>>
>>
>>
>>
>>
>> On Thu, Jan 15, 2015 at 5:42 PM, Robin East 
>> wrote:
>>
>> > -dev, +user
>> >
>> > You’ll need to set the gradient descent step size to something small - a
>> > bit of trial and error shows that 0.0001 works.
>> >
>> > You’ll need to create a LinearRegressionWithSGD instance and set the
>> step
>> > size explicitly:
>> >
>> > val lr = new LinearRegressionWithSGD()
>> > lr.optimizer.setStepSize(0.0001)
>> > lr.optimizer.setNumIterations(100)
>> > val model = lr.run(parsedData)
>> >
>> > On 15 Jan 2015, at 16:46, devl.development 
>> > wrote:
>> >
>> > From what I gather, you use LinearRegressionWithSGD to predict y or the
>> > response variable given a feature vector x.
>> >
>> > In a simple example I used a perfectly linear dataset such that x=y
>> > y,x
>> > 1,1
>> > 2,2
>> > ...
>> >
>> > 1,1
>> >
>> > Using the out-of-box example from the website (with and without
>> scaling):
>> >
>> > val data = sc.textFile(file)
>> >
>> >val parsedData = data.map { line =>
>> >  val parts = line.split(',')
>> > LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble))
>> //y
>> > and x
>> >
>> >}
>> >val scaler = new StandardScaler(withMean = true, withStd = true)
>> >  .fit(parsedData.map(x => x.features))
>> >val scaledData = parsedData
>> >  .map(x =>
>> >  LabeledPoint(x.label,
>> >scaler.transform(Vectors.dense(x.features.toArray
>> >
>> >// Building the model
>> >val numIterations = 100
>> >val model = LinearRegressionWithSGD.train(parsedData, numIterations)
>> >
>> >// Evaluate model on training examples and compute training error *
>> > tried using both scaledData and parsedData
>> >val valuesAndPreds = scaledData.map { point =>
>> >  val prediction = model.predict(point.features)
>> >  (point.label, prediction)
>> >}
>> >val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p),
>> 2)}.mean()
>> >println("training Mean Squared Error = " + MSE)
>> >
>> > Both scaled and unscaled attempts give:
>> >
>> > training Mean Squared Error = NaN
>> >
>> > I've even tried x, y+(sample noise from normal with mean 0 and stddev 1)
>> > still comes up with the same thing.
>> >
>> > Is this not supposed to work for x and y or 2 dimensional plots? Is
>> there
>> > something I'm missing or wrong in the code above? Or is there a
>> limitation
>> > in the method?
>> >
>> > Thanks for any advice.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>> >
>>
>
>


Re: Spark SQL API changes and stabilization

2015-01-15 Thread Reynold Xin
We can look into some sort of util class in sql.types for general type
inference. In general many methods in JsonRDD might be useful enough to
extract. Those will probably be marked as DeveloperAPI with less stability
guarantees.

On Thu, Jan 15, 2015 at 12:16 PM, Corey Nolet  wrote:

> Reynold,
>
> One thing I'd like worked into the public portion of the API is the json
> inferencing logic that creates a Set[(String, StructType)] out of
> Map[String,Any]. SPARK-5260 addresses this so that I can use Accumulators
> to infer my schema instead of forcing a map/reduce phase to occur on an RDD
> in order to get the final schema. Do you (or anyone else) see a path
> forward in exposing this to users? A utility class perhaps?
>
> On Thu, Jan 15, 2015 at 1:33 PM, Reynold Xin  wrote:
>
>> Alex,
>>
>> I didn't communicate properly. By "private", I simply meant the
>> expectation
>> that it is not a public API. The plan is to still omit it from the
>> scaladoc/javadoc generation, but no language visibility modifier will be
>> applied on them.
>>
>> After 1.3, you will likely no longer need to use things in sql.catalyst
>> package directly. Programmatically construct SchemaRDDs is going to be a
>> first class public API. Data types have already been moved out of the
>> sql.catalyst package and now lives in sql.types. They are becoming stable
>> public APIs. When the "data frame" patch is submitted, you will see a
>> public expression library also. There will be few reason for end users or
>> library developers to hook into things in sql.catalyst. For the bravest
>> and
>> the most advanced, they can still use them, with the expectation that it
>> is
>> subject to change.
>>
>>
>>
>>
>>
>> On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta <
>> alexbare...@gmail.com>
>> wrote:
>>
>> > Reynold,
>> >
>> > Thanks for the heads up. In general, I strongly oppose the use of
>> > "private" to restrict access to certain parts of the API, the reason
>> being
>> > that I might find the need to use some of the internals of a library
>> from
>> > my own project. I find that a @DeveloperAPI annotation serves the same
>> > purpose as "private" without imposing unnecessary restrictions: it
>> > discourages people from using the annotated API and reserves the right
>> for
>> > the core developers to change it suddenly in backwards incompatible
>> ways.
>> >
>> > In particular, I would like to express the desire that the APIs to
>> > programmatically construct SchemaRDDs from an RDD[Row] and a StructType
>> > remain public. All the SparkSQL data type objects should be exposed by
>> the
>> > API, and the jekyll build should not hide the docs as it does now.
>> >
>> > Thanks.
>> >
>> > Alex
>> >
>> > On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin 
>> wrote:
>> >
>> >> Hi Spark devs,
>> >>
>> >> Given the growing number of developers that are building on Spark SQL,
>> we
>> >> would like to stabilize the API in 1.3 so users and developers can be
>> >> confident to build on it. This also gives us a chance to improve the
>> API.
>> >>
>> >> In particular, we are proposing the following major changes. This
>> should
>> >> have no impact for most users (i.e. those running SQL through the JDBC
>> >> client or SQLContext.sql method).
>> >>
>> >> 1. Everything in sql.catalyst package is private to the project.
>> >>
>> >> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
>> >> SchemaRDD and logical plans in order to construct test cases. We have
>> >> received feedback from a lot of users that the DSL can be incredibly
>> >> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for
>> >> not
>> >> only constructing test cases, but also in everyday data pipelines. The
>> new
>> >> SchemaRDD API is inspired by the data frame concept in Pandas and R.
>> >>
>> >> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose
>> one
>> >> set of APIs that will work for both Java and Scala. The current Java
>> API
>> >> (sql.api.java) does not share any common ancestor with the Scala API.
>> This
>> >> led to high maintenance burden for us as Spark developers and for
>> library
>> >> developers. We propose to eliminate the Java specific API, and simply
>> work
>> >> on the existing Scala API to make it also usable for Java. This will
>> make
>> >> Java a first class citizen as Scala. This effectively means that all
>> >> public
>> >> classes should be usable for both Scala and Java, including SQLContext,
>> >> HiveContext, SchemaRDD, data types, and the aforementioned DSL.
>> >>
>> >>
>> >> Again, this should have no impact on most users since the existing DSL
>> is
>> >> rarely used by end users. However, library developers might need to
>> change
>> >> the import statements because we are moving certain classes around. We
>> >> will
>> >> keep you posted as patches are merged.
>> >>
>> >
>> >
>>
>
>


Re: Join implementation in SparkSQL

2015-01-15 Thread Andrew Ash
What Reynold is describing is a performance optimization in implementation,
but the semantics of the join (cartesian product plus relational algebra
filter) should be the same and produce the same results.

On Thu, Jan 15, 2015 at 1:36 PM, Reynold Xin  wrote:

> It's a bunch of strategies defined here:
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
>
> In most common use cases (e.g. inner equi join), filters are pushed below
> the join or into the join. Doing a cartesian product followed by a filter
> is too expensive.
>
>
> On Thu, Jan 15, 2015 at 7:39 AM, Alessandro Baretta  >
> wrote:
>
> > Hello,
> >
> > Where can I find docs about how joins are implemented in SparkSQL? In
> > particular, I'd like to know whether they are implemented according to
> > their relational algebra definition as filters on top of a cartesian
> > product.
> >
> > Thanks,
> >
> > Alex
> >
>


Re: LinearRegressionWithSGD accuracy

2015-01-15 Thread Joseph Bradley
It looks like you're training on the non-scaled data but testing on the
scaled data.  Have you tried this training & testing on only the scaled
data?

On Thu, Jan 15, 2015 at 10:42 AM, Devl Devel 
wrote:

> Thanks, that helps a bit at least with the NaN but the MSE is still very
> high even with that step size and 10k iterations:
>
> training Mean Squared Error = 3.3322561285919316E7
>
> Does this method need say 100k iterations?
>
>
>
>
>
>
> On Thu, Jan 15, 2015 at 5:42 PM, Robin East 
> wrote:
>
> > -dev, +user
> >
> > You’ll need to set the gradient descent step size to something small - a
> > bit of trial and error shows that 0.0001 works.
> >
> > You’ll need to create a LinearRegressionWithSGD instance and set the step
> > size explicitly:
> >
> > val lr = new LinearRegressionWithSGD()
> > lr.optimizer.setStepSize(0.0001)
> > lr.optimizer.setNumIterations(100)
> > val model = lr.run(parsedData)
> >
> > On 15 Jan 2015, at 16:46, devl.development 
> > wrote:
> >
> > From what I gather, you use LinearRegressionWithSGD to predict y or the
> > response variable given a feature vector x.
> >
> > In a simple example I used a perfectly linear dataset such that x=y
> > y,x
> > 1,1
> > 2,2
> > ...
> >
> > 1,1
> >
> > Using the out-of-box example from the website (with and without scaling):
> >
> > val data = sc.textFile(file)
> >
> >val parsedData = data.map { line =>
> >  val parts = line.split(',')
> > LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y
> > and x
> >
> >}
> >val scaler = new StandardScaler(withMean = true, withStd = true)
> >  .fit(parsedData.map(x => x.features))
> >val scaledData = parsedData
> >  .map(x =>
> >  LabeledPoint(x.label,
> >scaler.transform(Vectors.dense(x.features.toArray
> >
> >// Building the model
> >val numIterations = 100
> >val model = LinearRegressionWithSGD.train(parsedData, numIterations)
> >
> >// Evaluate model on training examples and compute training error *
> > tried using both scaledData and parsedData
> >val valuesAndPreds = scaledData.map { point =>
> >  val prediction = model.predict(point.features)
> >  (point.label, prediction)
> >}
> >val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p),
> 2)}.mean()
> >println("training Mean Squared Error = " + MSE)
> >
> > Both scaled and unscaled attempts give:
> >
> > training Mean Squared Error = NaN
> >
> > I've even tried x, y+(sample noise from normal with mean 0 and stddev 1)
> > still comes up with the same thing.
> >
> > Is this not supposed to work for x and y or 2 dimensional plots? Is there
> > something I'm missing or wrong in the code above? Or is there a
> limitation
> > in the method?
> >
> > Thanks for any advice.
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
> >
>


Re: Spark SQL API changes and stabilization

2015-01-15 Thread Corey Nolet
Reynold,

One thing I'd like worked into the public portion of the API is the json
inferencing logic that creates a Set[(String, StructType)] out of
Map[String,Any]. SPARK-5260 addresses this so that I can use Accumulators
to infer my schema instead of forcing a map/reduce phase to occur on an RDD
in order to get the final schema. Do you (or anyone else) see a path
forward in exposing this to users? A utility class perhaps?

On Thu, Jan 15, 2015 at 1:33 PM, Reynold Xin  wrote:

> Alex,
>
> I didn't communicate properly. By "private", I simply meant the expectation
> that it is not a public API. The plan is to still omit it from the
> scaladoc/javadoc generation, but no language visibility modifier will be
> applied on them.
>
> After 1.3, you will likely no longer need to use things in sql.catalyst
> package directly. Programmatically construct SchemaRDDs is going to be a
> first class public API. Data types have already been moved out of the
> sql.catalyst package and now lives in sql.types. They are becoming stable
> public APIs. When the "data frame" patch is submitted, you will see a
> public expression library also. There will be few reason for end users or
> library developers to hook into things in sql.catalyst. For the bravest and
> the most advanced, they can still use them, with the expectation that it is
> subject to change.
>
>
>
>
>
> On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta  >
> wrote:
>
> > Reynold,
> >
> > Thanks for the heads up. In general, I strongly oppose the use of
> > "private" to restrict access to certain parts of the API, the reason
> being
> > that I might find the need to use some of the internals of a library from
> > my own project. I find that a @DeveloperAPI annotation serves the same
> > purpose as "private" without imposing unnecessary restrictions: it
> > discourages people from using the annotated API and reserves the right
> for
> > the core developers to change it suddenly in backwards incompatible ways.
> >
> > In particular, I would like to express the desire that the APIs to
> > programmatically construct SchemaRDDs from an RDD[Row] and a StructType
> > remain public. All the SparkSQL data type objects should be exposed by
> the
> > API, and the jekyll build should not hide the docs as it does now.
> >
> > Thanks.
> >
> > Alex
> >
> > On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin 
> wrote:
> >
> >> Hi Spark devs,
> >>
> >> Given the growing number of developers that are building on Spark SQL,
> we
> >> would like to stabilize the API in 1.3 so users and developers can be
> >> confident to build on it. This also gives us a chance to improve the
> API.
> >>
> >> In particular, we are proposing the following major changes. This should
> >> have no impact for most users (i.e. those running SQL through the JDBC
> >> client or SQLContext.sql method).
> >>
> >> 1. Everything in sql.catalyst package is private to the project.
> >>
> >> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
> >> SchemaRDD and logical plans in order to construct test cases. We have
> >> received feedback from a lot of users that the DSL can be incredibly
> >> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for
> >> not
> >> only constructing test cases, but also in everyday data pipelines. The
> new
> >> SchemaRDD API is inspired by the data frame concept in Pandas and R.
> >>
> >> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose
> one
> >> set of APIs that will work for both Java and Scala. The current Java API
> >> (sql.api.java) does not share any common ancestor with the Scala API.
> This
> >> led to high maintenance burden for us as Spark developers and for
> library
> >> developers. We propose to eliminate the Java specific API, and simply
> work
> >> on the existing Scala API to make it also usable for Java. This will
> make
> >> Java a first class citizen as Scala. This effectively means that all
> >> public
> >> classes should be usable for both Scala and Java, including SQLContext,
> >> HiveContext, SchemaRDD, data types, and the aforementioned DSL.
> >>
> >>
> >> Again, this should have no impact on most users since the existing DSL
> is
> >> rarely used by end users. However, library developers might need to
> change
> >> the import statements because we are moving certain classes around. We
> >> will
> >> keep you posted as patches are merged.
> >>
> >
> >
>


Re: LinearRegressionWithSGD accuracy

2015-01-15 Thread Devl Devel
Thanks, that helps a bit at least with the NaN but the MSE is still very
high even with that step size and 10k iterations:

training Mean Squared Error = 3.3322561285919316E7

Does this method need say 100k iterations?






On Thu, Jan 15, 2015 at 5:42 PM, Robin East  wrote:

> -dev, +user
>
> You’ll need to set the gradient descent step size to something small - a
> bit of trial and error shows that 0.0001 works.
>
> You’ll need to create a LinearRegressionWithSGD instance and set the step
> size explicitly:
>
> val lr = new LinearRegressionWithSGD()
> lr.optimizer.setStepSize(0.0001)
> lr.optimizer.setNumIterations(100)
> val model = lr.run(parsedData)
>
> On 15 Jan 2015, at 16:46, devl.development 
> wrote:
>
> From what I gather, you use LinearRegressionWithSGD to predict y or the
> response variable given a feature vector x.
>
> In a simple example I used a perfectly linear dataset such that x=y
> y,x
> 1,1
> 2,2
> ...
>
> 1,1
>
> Using the out-of-box example from the website (with and without scaling):
>
> val data = sc.textFile(file)
>
>val parsedData = data.map { line =>
>  val parts = line.split(',')
> LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y
> and x
>
>}
>val scaler = new StandardScaler(withMean = true, withStd = true)
>  .fit(parsedData.map(x => x.features))
>val scaledData = parsedData
>  .map(x =>
>  LabeledPoint(x.label,
>scaler.transform(Vectors.dense(x.features.toArray
>
>// Building the model
>val numIterations = 100
>val model = LinearRegressionWithSGD.train(parsedData, numIterations)
>
>// Evaluate model on training examples and compute training error *
> tried using both scaledData and parsedData
>val valuesAndPreds = scaledData.map { point =>
>  val prediction = model.predict(point.features)
>  (point.label, prediction)
>}
>val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
>println("training Mean Squared Error = " + MSE)
>
> Both scaled and unscaled attempts give:
>
> training Mean Squared Error = NaN
>
> I've even tried x, y+(sample noise from normal with mean 0 and stddev 1)
> still comes up with the same thing.
>
> Is this not supposed to work for x and y or 2 dimensional plots? Is there
> something I'm missing or wrong in the code above? Or is there a limitation
> in the method?
>
> Thanks for any advice.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>


Re: Graphx TripletFields written in Java?

2015-01-15 Thread Reynold Xin
The static fields - Scala can't express JVM static fields unfortunately.
Those will be important once we provide the Java API.



On Thu, Jan 15, 2015 at 8:58 AM, Jay Hutfles  wrote:

> Hi all,
>   Does anyone know the reasoning behind implementing
> org.apache.spark.graphx.TripletFields in Java instead of Scala?  It doesn't
> look like there's anything in there that couldn't be done in Scala.
> Nothing serious, just curious.  Thanks!
>-Jay
>


Graphx TripletFields written in Java?

2015-01-15 Thread Jay Hutfles
Hi all,
  Does anyone know the reasoning behind implementing
org.apache.spark.graphx.TripletFields in Java instead of Scala?  It doesn't
look like there's anything in there that couldn't be done in Scala.
Nothing serious, just curious.  Thanks!
   -Jay


Re: Join implementation in SparkSQL

2015-01-15 Thread Reynold Xin
It's a bunch of strategies defined here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

In most common use cases (e.g. inner equi join), filters are pushed below
the join or into the join. Doing a cartesian product followed by a filter
is too expensive.


On Thu, Jan 15, 2015 at 7:39 AM, Alessandro Baretta 
wrote:

> Hello,
>
> Where can I find docs about how joins are implemented in SparkSQL? In
> particular, I'd like to know whether they are implemented according to
> their relational algebra definition as filters on top of a cartesian
> product.
>
> Thanks,
>
> Alex
>


Re: Spark SQL API changes and stabilization

2015-01-15 Thread Reynold Xin
Alex,

I didn't communicate properly. By "private", I simply meant the expectation
that it is not a public API. The plan is to still omit it from the
scaladoc/javadoc generation, but no language visibility modifier will be
applied on them.

After 1.3, you will likely no longer need to use things in sql.catalyst
package directly. Programmatically construct SchemaRDDs is going to be a
first class public API. Data types have already been moved out of the
sql.catalyst package and now lives in sql.types. They are becoming stable
public APIs. When the "data frame" patch is submitted, you will see a
public expression library also. There will be few reason for end users or
library developers to hook into things in sql.catalyst. For the bravest and
the most advanced, they can still use them, with the expectation that it is
subject to change.





On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta 
wrote:

> Reynold,
>
> Thanks for the heads up. In general, I strongly oppose the use of
> "private" to restrict access to certain parts of the API, the reason being
> that I might find the need to use some of the internals of a library from
> my own project. I find that a @DeveloperAPI annotation serves the same
> purpose as "private" without imposing unnecessary restrictions: it
> discourages people from using the annotated API and reserves the right for
> the core developers to change it suddenly in backwards incompatible ways.
>
> In particular, I would like to express the desire that the APIs to
> programmatically construct SchemaRDDs from an RDD[Row] and a StructType
> remain public. All the SparkSQL data type objects should be exposed by the
> API, and the jekyll build should not hide the docs as it does now.
>
> Thanks.
>
> Alex
>
> On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin  wrote:
>
>> Hi Spark devs,
>>
>> Given the growing number of developers that are building on Spark SQL, we
>> would like to stabilize the API in 1.3 so users and developers can be
>> confident to build on it. This also gives us a chance to improve the API.
>>
>> In particular, we are proposing the following major changes. This should
>> have no impact for most users (i.e. those running SQL through the JDBC
>> client or SQLContext.sql method).
>>
>> 1. Everything in sql.catalyst package is private to the project.
>>
>> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
>> SchemaRDD and logical plans in order to construct test cases. We have
>> received feedback from a lot of users that the DSL can be incredibly
>> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for
>> not
>> only constructing test cases, but also in everyday data pipelines. The new
>> SchemaRDD API is inspired by the data frame concept in Pandas and R.
>>
>> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose one
>> set of APIs that will work for both Java and Scala. The current Java API
>> (sql.api.java) does not share any common ancestor with the Scala API. This
>> led to high maintenance burden for us as Spark developers and for library
>> developers. We propose to eliminate the Java specific API, and simply work
>> on the existing Scala API to make it also usable for Java. This will make
>> Java a first class citizen as Scala. This effectively means that all
>> public
>> classes should be usable for both Scala and Java, including SQLContext,
>> HiveContext, SchemaRDD, data types, and the aforementioned DSL.
>>
>>
>> Again, this should have no impact on most users since the existing DSL is
>> rarely used by end users. However, library developers might need to change
>> the import statements because we are moving certain classes around. We
>> will
>> keep you posted as patches are merged.
>>
>
>


Re: LinearRegressionWithSGD accuracy

2015-01-15 Thread Robin East
-dev, +user

You’ll need to set the gradient descent step size to something small - a bit of 
trial and error shows that 0.0001 works.

You’ll need to create a LinearRegressionWithSGD instance and set the step size 
explicitly:

val lr = new LinearRegressionWithSGD()
lr.optimizer.setStepSize(0.0001)
lr.optimizer.setNumIterations(100)
val model = lr.run(parsedData)

On 15 Jan 2015, at 16:46, devl.development  wrote:

> From what I gather, you use LinearRegressionWithSGD to predict y or the
> response variable given a feature vector x.
> 
> In a simple example I used a perfectly linear dataset such that x=y
> y,x
> 1,1
> 2,2
> ...
> 
> 1,1
> 
> Using the out-of-box example from the website (with and without scaling):
> 
> val data = sc.textFile(file)
> 
>val parsedData = data.map { line =>
>  val parts = line.split(',')
> LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y
> and x
> 
>}
>val scaler = new StandardScaler(withMean = true, withStd = true)
>  .fit(parsedData.map(x => x.features))
>val scaledData = parsedData
>  .map(x =>
>  LabeledPoint(x.label,
>scaler.transform(Vectors.dense(x.features.toArray
> 
>// Building the model
>val numIterations = 100
>val model = LinearRegressionWithSGD.train(parsedData, numIterations)
> 
>// Evaluate model on training examples and compute training error *
> tried using both scaledData and parsedData
>val valuesAndPreds = scaledData.map { point =>
>  val prediction = model.predict(point.features)
>  (point.label, prediction)
>}
>val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
>println("training Mean Squared Error = " + MSE)
> 
> Both scaled and unscaled attempts give:
> 
> training Mean Squared Error = NaN
> 
> I've even tried x, y+(sample noise from normal with mean 0 and stddev 1)
> still comes up with the same thing.
> 
> Is this not supposed to work for x and y or 2 dimensional plots? Is there
> something I'm missing or wrong in the code above? Or is there a limitation
> in the method?
> 
> Thanks for any advice.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 



LinearRegressionWithSGD accuracy

2015-01-15 Thread devl.development
>From what I gather, you use LinearRegressionWithSGD to predict y or the
response variable given a feature vector x.

In a simple example I used a perfectly linear dataset such that x=y
y,x
1,1
2,2
...

1,1

Using the out-of-box example from the website (with and without scaling):

 val data = sc.textFile(file)

val parsedData = data.map { line =>
  val parts = line.split(',')
 LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y
and x

}
val scaler = new StandardScaler(withMean = true, withStd = true)
  .fit(parsedData.map(x => x.features))
val scaledData = parsedData
  .map(x =>
  LabeledPoint(x.label,
scaler.transform(Vectors.dense(x.features.toArray

// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)

// Evaluate model on training examples and compute training error *
tried using both scaledData and parsedData
val valuesAndPreds = scaledData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)

Both scaled and unscaled attempts give:

training Mean Squared Error = NaN

I've even tried x, y+(sample noise from normal with mean 0 and stddev 1)
still comes up with the same thing.

Is this not supposed to work for x and y or 2 dimensional plots? Is there
something I'm missing or wrong in the code above? Or is there a limitation
in the method?

Thanks for any advice.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Implementing TinkerPop on top of GraphX

2015-01-15 Thread David Robinson
I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and
think the idea of using Tinkerpop as the API for GraphX is a great idea and
hope you are still headed in that direction.  I noticed that Tinkerpop 3 is
moving into the Apache family:
http://wiki.apache.org/incubator/TinkerPopProposal  which might alleviate
concerns about having an API definition "outside" of Spark.

Thanks,




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Implementing-TinkerPop-on-top-of-GraphX-tp9169p10126.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL API changes and stabilization

2015-01-15 Thread Alessandro Baretta
Reynold,

Thanks for the heads up. In general, I strongly oppose the use of "private"
to restrict access to certain parts of the API, the reason being that I
might find the need to use some of the internals of a library from my own
project. I find that a @DeveloperAPI annotation serves the same purpose as
"private" without imposing unnecessary restrictions: it discourages people
from using the annotated API and reserves the right for the core developers
to change it suddenly in backwards incompatible ways.

In particular, I would like to express the desire that the APIs to
programmatically construct SchemaRDDs from an RDD[Row] and a StructType
remain public. All the SparkSQL data type objects should be exposed by the
API, and the jekyll build should not hide the docs as it does now.

Thanks.

Alex

On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin  wrote:

> Hi Spark devs,
>
> Given the growing number of developers that are building on Spark SQL, we
> would like to stabilize the API in 1.3 so users and developers can be
> confident to build on it. This also gives us a chance to improve the API.
>
> In particular, we are proposing the following major changes. This should
> have no impact for most users (i.e. those running SQL through the JDBC
> client or SQLContext.sql method).
>
> 1. Everything in sql.catalyst package is private to the project.
>
> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
> SchemaRDD and logical plans in order to construct test cases. We have
> received feedback from a lot of users that the DSL can be incredibly
> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for not
> only constructing test cases, but also in everyday data pipelines. The new
> SchemaRDD API is inspired by the data frame concept in Pandas and R.
>
> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose one
> set of APIs that will work for both Java and Scala. The current Java API
> (sql.api.java) does not share any common ancestor with the Scala API. This
> led to high maintenance burden for us as Spark developers and for library
> developers. We propose to eliminate the Java specific API, and simply work
> on the existing Scala API to make it also usable for Java. This will make
> Java a first class citizen as Scala. This effectively means that all public
> classes should be usable for both Scala and Java, including SQLContext,
> HiveContext, SchemaRDD, data types, and the aforementioned DSL.
>
>
> Again, this should have no impact on most users since the existing DSL is
> rarely used by end users. However, library developers might need to change
> the import statements because we are moving certain classes around. We will
> keep you posted as patches are merged.
>


Join implementation in SparkSQL

2015-01-15 Thread Alessandro Baretta
Hello,

Where can I find docs about how joins are implemented in SparkSQL? In
particular, I'd like to know whether they are implemented according to
their relational algebra definition as filters on top of a cartesian
product.

Thanks,

Alex


Spark 1.2.0: MissingRequirementError

2015-01-15 Thread PierreB
Hi guys,

A few people seem to have the same problem with Spark 1.2.0 so I figured I
would push it here.

see:
http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-td21149.html

In a nutshell, for sbt test to work, we now need to fork a JVM and also give
more memory to be able to run tests.

See
also:https://github.com/deanwampler/spark-workshop/blob/master/project/Build.scala

This all used to work fine until 1.2.0.

Could u have a look please?
Thanks

P.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-2-0-MissingRequirementError-tp10123.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark client reconnect to driver in yarn-cluster deployment mode

2015-01-15 Thread preeze
>From the official spark documentation
(http://spark.apache.org/docs/1.2.0/running-on-yarn.html):

"In yarn-cluster mode, the Spark driver runs inside an application master
process which is managed by YARN on the cluster, and the client can go away
after initiating the application."

Is there any designed way that the client connects back to the driver (still
running in YARN) for collecting results at a later stage?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-client-reconnect-to-driver-in-yarn-cluster-deployment-mode-tp10122.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SciSpark: NASA AIST14 proposal

2015-01-15 Thread andy petrella
Hey Chris,

This sounds amazing!
You might have to check also with the Geotrellis
 team (Rob and Eugene for
instance) who have already covered quite interesting ground dealing with
tiles as RDD element.
Some algebra operations are there, but also thingies like Shortest Path
(within rasters).

A small, I've a student who is working on a implementation of LU/LC using
Spark (first using CA, then I hope extension using stochastic methods like
local random forest).

If you consider implementing a R-Tree (or perhaps the SD version) for OGIs
operation, I thought that IndexedRDD
 could be interesting to
consider (I've been asked to look at options to implement this kind of
distributed and resilient R-Tree, so I'll be happy to see how it'd perform
^^).

cheers and have fun!
andy


On Thu Jan 15 2015 at 5:53:27 AM RJ Nowling  wrote:

> Congratulations, Chris!
>
> I created a JIRA for "dimensional" RDDs that might be relevant:
> https://issues.apache.org/jira/browse/SPARK-4727
>
> Jeremy Freeman pointed me to his lab's work on for neuroscience that have
> some related functionality :
> http://thefreemanlab.com/thunder/
>
> On Wed, Jan 14, 2015 at 11:07 PM, Aniket 
> wrote:
>
> > Hi Chris
> >
> > This is super cool. I was wondering if this would be an open source
> project
> > so that people can contribute or reuse?
> >
> > Thanks,
> > Aniket
> >
> > On Thu Jan 15 2015 at 07:39:29 Mattmann, Chris A (3980) [via Apache Spark
> > Developers List]  wrote:
> >
> > > Hi Spark Devs,
> > >
> > > Just wanted to FYI that I was funded on a 2 year NASA proposal
> > > to build out the concept of a scientific RDD (create by space/time,
> > > and other operations) for use in some neat climate related NASA
> > > use cases.
> > >
> > >
> > http://esto.nasa.gov/files/solicitations/AIST_14/
> ROSES2014_AIST_A41_awards
> > .
> > >
> > > html
> > >
> > >
> > > I will keep everyone posted and plan on interacting with the list
> > > over here to get it done. I expect that we’ll start work in March.
> > > In the meanwhile you guys can scope the abstract at the link provided.
> > > Happy
> > > to chat about it if you have any questions too.
> > >
> > > Cheers!
> > >
> > > Chris
> > >
> > > ++
> > > Chris Mattmann, Ph.D.
> > > Chief Architect
> > > Instrument Software and Science Data Systems Section (398)
> > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > Office: 168-519, Mailstop: 168-527
> > > Email: [hidden email]
> > > 
> > > WWW:  http://sunset.usc.edu/~mattmann/
> > > ++
> > > Adjunct Associate Professor, Computer Science Department
> > > University of Southern California, Los Angeles, CA 90089 USA
> > > ++
> > >
> > >
> > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [hidden email]
> > > 
> > > For additional commands, e-mail: [hidden email]
> > > 
> > >
> > >
> > > --
> > >  If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> > http://apache-spark-developers-list.1001551.n3.nabble.com/SciSpark-NASA-
> AIST14-proposal-tp10115.html
> > >  To start a new topic under Apache Spark Developers List, email
> > > ml-node+s1001551n1...@n3.nabble.com
> > > To unsubscribe from Apache Spark Developers List, click here
> > > <
> > http://apache-spark-developers-list.1001551.n3.nabble.com/template/
> NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=
> YW5pa2V0LmJoYXRuYWdhckBnbWFpbC5jb218MXwxMzE3NTAzMzQz
> > >
> > > .
> > > NAML
> > > <
> > http://apache-spark-developers-list.1001551.n3.nabble.com/template/
> NamlServlet.jtp?macro=macro_viewer&id=instant_html%
> 21nabble%3Aemail.naml&base=nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.naml.namespaces.BasicNamespace-nabble.view.
> web.template.NabbleNamespace-nabble.view.web.template.
> NodeNamespace&breadcrumbs=notify_subscribers%21nabble%
> 3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_
> instant_email%21nabble%3Aemail.naml
> > >
> > >
> >
> >
> >
> >
> > --
> > View this message in context:
> > http://apache-spark-developers-list.1001551.n3.nabble.com/SciSpark-NASA-
> AIST14-proposal-tp10115p10118.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
>