[GitHub] spark issue #21944: [SPARK-24988][SQL]Add a castBySchema method which casts ...

2018-08-02 Thread mahmoudmahdi24
Github user mahmoudmahdi24 commented on the issue:

https://github.com/apache/spark/pull/21944
  
Closed the PR. This might be a useful trick, but we want to avoid adding 
many methods to the API. We'll reopen this in case many users asks for it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

2018-08-02 Thread mahmoudmahdi24
Github user mahmoudmahdi24 closed the pull request at:

https://github.com/apache/spark/pull/21944


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

2018-08-02 Thread mahmoudmahdi24
Github user mahmoudmahdi24 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21944#discussion_r207250060
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
 }: _*)
   }
 
+  /**
+   * Casts all the values of the current Dataset following the types of a 
specific StructType.
+   * This method works also with nested structTypes.
+   *
+   *  @group typedrel
+   *  @since 2.4.0
+   */
+  def castBySchema(schema: StructType): DataFrame = {
+
assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
+  "schema should have the same fields as the original schema")
+
+selectExpr(schema.map(
--- End diff --

Ok I understand, Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

2018-08-02 Thread mahmoudmahdi24
Github user mahmoudmahdi24 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21944#discussion_r207209633
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
 }: _*)
   }
 
+  /**
+   * Casts all the values of the current Dataset following the types of a 
specific StructType.
+   * This method works also with nested structTypes.
+   *
+   *  @group typedrel
+   *  @since 2.4.0
+   */
+  def castBySchema(schema: StructType): DataFrame = {
+
assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
+  "schema should have the same fields as the original schema")
+
+selectExpr(schema.map(
--- End diff --

I am still not convinced about not adding this since it might be helpful 
for Spark users.
I totally agree about avoiding to add it to the Dataset Api, but can we add 
it to the functions object in the org.apache.spark.sql package for example ? 
Thanks 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

2018-08-02 Thread mahmoudmahdi24
Github user mahmoudmahdi24 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21944#discussion_r207164547
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
 }: _*)
   }
 
+  /**
+   * Casts all the values of the current Dataset following the types of a 
specific StructType.
+   * This method works also with nested structTypes.
+   *
+   *  @group typedrel
+   *  @since 2.4.0
+   */
+  def castBySchema(schema: StructType): DataFrame = {
+
assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
+  "schema should have the same fields as the original schema")
+
+selectExpr(schema.map(
--- End diff --

@maropu shall we add the @Experimental annotation in that case ? What do 
you suggest to make this method less sensitive ? Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21961: Spark 20597

2018-08-02 Thread mahmoudmahdi24
Github user mahmoudmahdi24 commented on the issue:

https://github.com/apache/spark/pull/21961
  
Hello @Satyajitv, please rename the title of this PR properly. 
The PR title should be of the form [SPARK-][COMPONENT] Title, where 
SPARK- is the relevant JIRA number, COMPONENT is one of the PR categories 
shown at spark-prs.appspot.com and Title may be the JIRA’s title or a more 
specific title describing the PR itself.
Take a look to this helpful document : 
https://spark.apache.org/contributing.html


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

2018-08-02 Thread mahmoudmahdi24
Github user mahmoudmahdi24 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21944#discussion_r207156078
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
 }: _*)
   }
 
+  /**
+   * Casts all the values of the current Dataset following the types of a 
specific StructType.
+   * This method works also with nested structTypes.
+   *
+   *  @group typedrel
+   *  @since 2.4.0
+   */
+  def castBySchema(schema: StructType): DataFrame = {
+
assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
+  "schema should have the same fields as the original schema")
+
+selectExpr(schema.map(
--- End diff --

Hello @HyukjinKwon, Thanks for your feedback.
Actually, some methods are one liner in the current API, but they help 
users by allowing them to know what they can do via the API (printSchema, 
dtypes, columns are perfect examples for that).
Even if it's a one liner, this method can be helpful since it will tell 
spark's users that they can cast a dataframe by passing a schema.
This method can be very useful in the Big Data world; generally, we parse 
different files which are in strings, and we have to cast all these values 
following a schema. In my case, my client provides Avro schemas to define the 
columns and names.
I searched a lot for a method which applies a schema on a dataframe but I 
couldn't find one 
(https://stackoverflow.com/questions/51561715/cast-values-of-a-spark-dataframe-using-a-defined-structtype/51562763#51562763).
Even by searching on github, I always see examples where people cast each 
value separately. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

2018-08-01 Thread mahmoudmahdi24
GitHub user mahmoudmahdi24 opened a pull request:

https://github.com/apache/spark/pull/21944

[SPARK-24988][SQL]Add a castBySchema method which casts all the values of a 
DataFrame based on the DataTypes of a StructType

## What changes were proposed in this pull request?

The main goal of this Pull Request is to extend the Dataframe methods in 
order to add a method which casts all the values of a Dataframe, based on the 
DataTypes of a StructType.

This feature can be useful when we have a large dataframe, and that we need 
to make multiple casts. In that case, we won't have to cast each value 
independently, all we have to do is to pass a StructType to the method 
castBySchema with the types we need (In real world examples, this schema is 
generally provided by the client, which was my case).

Here's an example here on how we can apply this method (let's say that we 
have a dataframe of strings, and that we want to cast the values of the second 
columns to Int) : 
```
// We start by creating the dataframe
val df = Seq(("test1", "0"), ("test2", "1")).toDF("name", "id")

// we prepare the StructType of the casted Dataframe that we'll obtain:
val schema = StructType( Seq( StructField("name", StringType, true), 
StructField("id", IntegerType, true)))

// and then, we simply use the method castBySchema :
val castedDf = df.castBySchema(schema) 
```


## How was this patch tested?

I modified DataFrameSuite in order to test the new added method 
(castBySchema).
I first tested the method on a simple dataframe with a simple schema, then 
I tested it on Dataframes with a complex schemas (Nested StructTypes for 
example).




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mahmoudmahdi24/spark SPARK-24988

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21944.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21944


commit b48819e3894e4d2f246fc2dba7db73ad5714757d
Author: mahmoud_mahdi 
Date:   2018-08-01T14:00:22Z

Add a castBySchema method which casts all the values of a DataFrame based 
on the DataTypes of a StructType




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21120: [SPARK-22448][ML] Added sum function to Summerize...

2018-06-29 Thread mahmoudmahdi24
Github user mahmoudmahdi24 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21120#discussion_r199119223
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -562,6 +573,23 @@ private[ml] object SummaryBuilderImpl extends Logging {
 
   Vectors.dense(currL1)
 }
+
+/**
+ * Sum of each dimension
+ */
+def sum: Vector = {
+  require(requestedMetrics.contains(Sum))
+  require(totalWeightSum > 0, s"Nothing has been added to this 
summarizer.")
+
+  val realSum = Array.ofDim[Double](n)
+  var i = 0
+  val len = currMean.length
+  while (i < len) {
+realSum(i) = currMean(i) * weightSum(i)
+i += 1
--- End diff --

Please avoid using mutable values, use foldLeft for example to solve this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org