Re: NA value handling in sparkR

Devesh Raj Singh Mon, 25 Jan 2016 03:48:07 -0800

Hi,

Yes you are right.


I think the problem is with reading of csv files. read.df is not
considering NAs in the CSV file

So what would be a workable solution in dealing with NAs in csv files?



On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <deborah.sie...@gmail.com>
wrote:

> Hi Devesh,
>
> I'm not certain why that's happening, and it looks like it doesn't happen
> if you use createDataFrame directly:
> aq <- createDataFrame(sqlContext,airquality)
> head(dropna(aq,how="any"))
>
> If I had to guess.. dropna(), I believe, drops null values. I suppose its
> possible that createDataFrame converts R's <NA> values to null, so dropna()
> works with that. But perhaps read.df() does not convert R <NA>s to null, as
> those are most likely interpreted as strings when they come in from the
> csv. Just a guess, can anyone confirm?
>
> Deb
>
>
>
>
>
>
> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <raj.deves...@gmail.com
> > wrote:
>
>> Hi,
>>
>> I have applied the following code on airquality dataset available in R ,
>> which has some missing values. I want to omit the rows which has NAs
>>
>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>
>> sc <- sparkR.init("local",sparkHome =
>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>
>> sqlContext <- sparkRSQL.init(sc)
>>
>> path<-"/Users/devesh/work/airquality/"
>>
>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>> header="true", inferSchema="true")
>>
>> head(dropna(aq,how="any"))
>>
>> I am getting the output as
>>
>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2
>> 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9
>> 66 5 6
>>
>> The NAs still exist in the output. Am I missing something here?
>>
>> --
>> Warm regards,
>> Devesh.
>>
>
>


-- 
Warm regards,
Devesh.

Re: NA value handling in sparkR

Reply via email to