Re: SparkR DataFrame Column Casts esp. from CSV Files

Reynold Xin Wed, 03 Jun 2015 10:54:48 -0700

I think Hossein does want to implement schema inference for CSV -- then
it'd be easy.


Another way you can do this is to use R dataframe/table to read the CSV
files in, and then convert it into a Spark DataFrames. Not going to be
scalable, but could work.

On Wed, Jun 3, 2015 at 10:49 AM, Eskilson,Aleksander <
[email protected]> wrote:

>  Hi Shivaram,
>
>  As far as databricks’ spark-csv API shows, it seems there’s currently
> only support for explicit definition of column types. In JSON we have nice
> typed fields, but in CSVs, all bets are off. In the SQL version of the API,
> it appears you specify the column types when you create the table you’re
> populating with CSV data.
>
>  Thanks for the clarification on individual column casting, I was missing
> the more obvious syntax.
>
>  I’ll file a JIRA for resetting the schema after loading a DF.
>
>  Thanks,
> Alek
>
>
>   From: Shivaram Venkataraman <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Wednesday, June 3, 2015 at 12:29 PM
> To: Aleksander Eskilson <[email protected]>
> Cc: "[email protected]" <[email protected]>, "[email protected]"
> <[email protected]>
> Subject: Re: SparkR DataFrame Column Casts esp. from CSV Files
>
>   cc Hossein who knows more about the spark-csv options
>
>  You are right that the default CSV reader options end up creating all
> columns as string. I know that the JSON reader infers the schema [1] but I
> don't know if the CSV reader has any options to do that.  Regarding the
> SparkR syntax to cast columns, I think there is a simpler way to do it by
> just assigning to the same column name. For example I have a flights
> DataFrame with the `year` column typed as string. To cast it to int I just
> use
>
>  flights$year <- cast(flights$year, "int")
>
>  Now the dataframe has the same number of columns as before and you don't
> need a selection.
>
>  However this still doesn't address the part about casting multiple
> columns -- Could you file a new JIRA to track the need for casting multiple
> columns or rather being able to set the schema after loading a DF ?
>
>  Thanks
> Shivaram
>
>  [1]
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apache.org_docs_latest_sql-2Dprogramming-2Dguide.html-23json-2Ddatasets&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=BX3MuobG748zhfm7hc_SnZA4MnFbwgFreNVEjkzkENc&e=>
>
>
> On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <
> [email protected]> wrote:
>
>>  It appears that casting columns remains a bit of a trick in Spark’s
>> DataFrames. This is an issue because tools like spark-csv will set column
>> types to String by default and will not attempt to infer types. Although
>> spark-csv supports specifying  types for columns in its options, it’s not
>> clear how that might be integrated into SparkR (when loading the spark-csv
>> package into the R session).
>>
>>  Looking at the column.R spec we can cast a column to a different data
>> type with the cast function [1], but it’s notable that this is not a
>> mutator, and it returns a column object as opposed to a DataFrame. It
>> appears the column cast can only be ‘applied’ by using the withColumn() or
>> mutate() (an alias for withColumn).
>>
>>  The other way to cast with Spark DataFrames is to write UDFs that
>> operate on a column value and return a coerced value. It looks like SparkR
>> doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do
>> a natural one-off column cast in R, something like
>>
>>  df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x)
>> as.numeric(x)))
>>
>>  (where col1 was originally ‘character’ type)
>>
>>  Currently it seems one has to
>> df.col1cast <- cast(df$col1, “int”)
>> df.col1toInt <- withColumn(df, df.col1cast)
>>
>>  If we wanted just our casted columns and not the original column from
>> the data frame, we’d still have to do a select. There was a conversation
>> about CSV files just yesterday. Types are already problematic, but they’re
>> a very common data source in R, even at scale.
>>
>>  But only being able to coerce one column at a time is really unwieldy.
>> Can the current spark-csv SQL API for specifying types [3] be extended
>> SparkR? And are there any thoughts on implementing some kind of type
>> inferencing perhaps based on a sampling of some number of rows (an
>> implementation I’ve seen before)? R’s read.csv() and read.delim() get types
>> by inferring from the whole file. Getting something that can achieve that
>> functionality via explicit definition of types or sampling will probably be
>> necessary to work with CSV files that have enough columns to merit R at
>> Spark’s scale.
>>
>>  Regards,
>> Alek Eskilson
>>
>>  [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_R_pkg_R_column.R-23L190&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=pETagpDWepAmeaxucEKv1BgoCjqqpIejSjZhXZFF_y8&e=>
>> [2] - https://issues.apache.org/jira/browse/SPARK-6817
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6817&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=tiLELUAU2Sgk680gUGLr9fR9YxEU6lJEs2e0gWenWhs&e=>
>> [3] - https://github.com/databricks/spark-csv#sql-api
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv-23sql-2Dapi&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=89QC5nymwl5GjjpMwUD--828WaTvjqik9glbCHR7T-8&e=>
>>
>> CONFIDENTIALITY NOTICE This message and any included attachments are from
>> Cerner Corporation and are intended only for the addressee. The information
>> contained in this message is confidential and may constitute inside or
>> non-public information under international, federal, or state securities
>> laws. Unauthorized forwarding, printing, copying, distribution, or use of
>> such information is strictly prohibited and may be unlawful. If you are not
>> the addressee, please promptly delete this message and notify the sender of
>> the delivery error by e-mail or you may call Cerner's corporate offices in
>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>
>
>

Re: SparkR DataFrame Column Casts esp. from CSV Files

Reply via email to