[GitHub] spark issue #21944: [SPARK-24988][SQL]Add a castBySchema method which casts ...
Github user mahmoudmahdi24 commented on the issue: https://github.com/apache/spark/pull/21944 Closed the PR. This might be a useful trick, but we want to avoid adding many methods to the API. We'll reopen this in case many users asks for it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...
Github user mahmoudmahdi24 closed the pull request at: https://github.com/apache/spark/pull/21944 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...
Github user mahmoudmahdi24 commented on a diff in the pull request: https://github.com/apache/spark/pull/21944#discussion_r207250060 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql]( }: _*) } + /** + * Casts all the values of the current Dataset following the types of a specific StructType. + * This method works also with nested structTypes. + * + * @group typedrel + * @since 2.4.0 + */ + def castBySchema(schema: StructType): DataFrame = { + assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList), + "schema should have the same fields as the original schema") + +selectExpr(schema.map( --- End diff -- Ok I understand, Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...
Github user mahmoudmahdi24 commented on a diff in the pull request: https://github.com/apache/spark/pull/21944#discussion_r207209633 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql]( }: _*) } + /** + * Casts all the values of the current Dataset following the types of a specific StructType. + * This method works also with nested structTypes. + * + * @group typedrel + * @since 2.4.0 + */ + def castBySchema(schema: StructType): DataFrame = { + assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList), + "schema should have the same fields as the original schema") + +selectExpr(schema.map( --- End diff -- I am still not convinced about not adding this since it might be helpful for Spark users. I totally agree about avoiding to add it to the Dataset Api, but can we add it to the functions object in the org.apache.spark.sql package for example ? Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...
Github user mahmoudmahdi24 commented on a diff in the pull request: https://github.com/apache/spark/pull/21944#discussion_r207164547 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql]( }: _*) } + /** + * Casts all the values of the current Dataset following the types of a specific StructType. + * This method works also with nested structTypes. + * + * @group typedrel + * @since 2.4.0 + */ + def castBySchema(schema: StructType): DataFrame = { + assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList), + "schema should have the same fields as the original schema") + +selectExpr(schema.map( --- End diff -- @maropu shall we add the @Experimental annotation in that case ? What do you suggest to make this method less sensitive ? Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21961: Spark 20597
Github user mahmoudmahdi24 commented on the issue: https://github.com/apache/spark/pull/21961 Hello @Satyajitv, please rename the title of this PR properly. The PR title should be of the form [SPARK-][COMPONENT] Title, where SPARK- is the relevant JIRA number, COMPONENT is one of the PR categories shown at spark-prs.appspot.com and Title may be the JIRAâs title or a more specific title describing the PR itself. Take a look to this helpful document : https://spark.apache.org/contributing.html --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...
Github user mahmoudmahdi24 commented on a diff in the pull request: https://github.com/apache/spark/pull/21944#discussion_r207156078 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql]( }: _*) } + /** + * Casts all the values of the current Dataset following the types of a specific StructType. + * This method works also with nested structTypes. + * + * @group typedrel + * @since 2.4.0 + */ + def castBySchema(schema: StructType): DataFrame = { + assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList), + "schema should have the same fields as the original schema") + +selectExpr(schema.map( --- End diff -- Hello @HyukjinKwon, Thanks for your feedback. Actually, some methods are one liner in the current API, but they help users by allowing them to know what they can do via the API (printSchema, dtypes, columns are perfect examples for that). Even if it's a one liner, this method can be helpful since it will tell spark's users that they can cast a dataframe by passing a schema. This method can be very useful in the Big Data world; generally, we parse different files which are in strings, and we have to cast all these values following a schema. In my case, my client provides Avro schemas to define the columns and names. I searched a lot for a method which applies a schema on a dataframe but I couldn't find one (https://stackoverflow.com/questions/51561715/cast-values-of-a-spark-dataframe-using-a-defined-structtype/51562763#51562763). Even by searching on github, I always see examples where people cast each value separately. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...
GitHub user mahmoudmahdi24 opened a pull request: https://github.com/apache/spark/pull/21944 [SPARK-24988][SQL]Add a castBySchema method which casts all the values of a DataFrame based on the DataTypes of a StructType ## What changes were proposed in this pull request? The main goal of this Pull Request is to extend the Dataframe methods in order to add a method which casts all the values of a Dataframe, based on the DataTypes of a StructType. This feature can be useful when we have a large dataframe, and that we need to make multiple casts. In that case, we won't have to cast each value independently, all we have to do is to pass a StructType to the method castBySchema with the types we need (In real world examples, this schema is generally provided by the client, which was my case). Here's an example here on how we can apply this method (let's say that we have a dataframe of strings, and that we want to cast the values of the second columns to Int) : ``` // We start by creating the dataframe val df = Seq(("test1", "0"), ("test2", "1")).toDF("name", "id") // we prepare the StructType of the casted Dataframe that we'll obtain: val schema = StructType( Seq( StructField("name", StringType, true), StructField("id", IntegerType, true))) // and then, we simply use the method castBySchema : val castedDf = df.castBySchema(schema) ``` ## How was this patch tested? I modified DataFrameSuite in order to test the new added method (castBySchema). I first tested the method on a simple dataframe with a simple schema, then I tested it on Dataframes with a complex schemas (Nested StructTypes for example). You can merge this pull request into a Git repository by running: $ git pull https://github.com/mahmoudmahdi24/spark SPARK-24988 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21944.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21944 commit b48819e3894e4d2f246fc2dba7db73ad5714757d Author: mahmoud_mahdi Date: 2018-08-01T14:00:22Z Add a castBySchema method which casts all the values of a DataFrame based on the DataTypes of a StructType --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21120: [SPARK-22448][ML] Added sum function to Summerize...
Github user mahmoudmahdi24 commented on a diff in the pull request: https://github.com/apache/spark/pull/21120#discussion_r199119223 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -562,6 +573,23 @@ private[ml] object SummaryBuilderImpl extends Logging { Vectors.dense(currL1) } + +/** + * Sum of each dimension + */ +def sum: Vector = { + require(requestedMetrics.contains(Sum)) + require(totalWeightSum > 0, s"Nothing has been added to this summarizer.") + + val realSum = Array.ofDim[Double](n) + var i = 0 + val len = currMean.length + while (i < len) { +realSum(i) = currMean(i) * weightSum(i) +i += 1 --- End diff -- Please avoid using mutable values, use foldLeft for example to solve this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org