GitHub user mahmoudmahdi24 opened a pull request:
https://github.com/apache/spark/pull/21944
[SPARK-24988][SQL]Add a castBySchema method which casts all the values of a
DataFrame based on the DataTypes of a StructType
## What changes were proposed in this pull request?
The main goal of this Pull Request is to extend the Dataframe methods in
order to add a method which casts all the values of a Dataframe, based on the
DataTypes of a StructType.
This feature can be useful when we have a large dataframe, and that we need
to make multiple casts. In that case, we won't have to cast each value
independently, all we have to do is to pass a StructType to the method
castBySchema with the types we need (In real world examples, this schema is
generally provided by the client, which was my case).
Here's an example here on how we can apply this method (let's say that we
have a dataframe of strings, and that we want to cast the values of the second
columns to Int) :
```
// We start by creating the dataframe
val df = Seq(("test1", "0"), ("test2", "1")).toDF("name", "id")
// we prepare the StructType of the casted Dataframe that we'll obtain:
val schema = StructType( Seq( StructField("name", StringType, true),
StructField("id", IntegerType, true)))
// and then, we simply use the method castBySchema :
val castedDf = df.castBySchema(schema)
```
## How was this patch tested?
I modified DataFrameSuite in order to test the new added method
(castBySchema).
I first tested the method on a simple dataframe with a simple schema, then
I tested it on Dataframes with a complex schemas (Nested StructTypes for
example).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mahmoudmahdi24/spark SPARK-24988
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21944.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21944
----
commit b48819e3894e4d2f246fc2dba7db73ad5714757d
Author: mahmoud_mahdi <mahmoudmahdi24@...>
Date: 2018-08-01T14:00:22Z
Add a castBySchema method which casts all the values of a DataFrame based
on the DataTypes of a StructType
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]