Maybe not ideal, but since read.df is inferring all columns from the csv
containing "NA" as type of strings, one could filter them rather than using

filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")

Perhaps it would be better to have an option for read.df to convert any
"NA" it encounters into null types, like createDataFrame does for <NA>, and
then one would be able to use dropna() etc.

On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <>

> Hi,
> Yes you are right.
> I think the problem is with reading of csv files. read.df is not
> considering NAs in the CSV file
> So what would be a workable solution in dealing with NAs in csv files?
> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <>
> wrote:
>> Hi Devesh,
>> I'm not certain why that's happening, and it looks like it doesn't happen
>> if you use createDataFrame directly:
>> aq <- createDataFrame(sqlContext,airquality)
>> head(dropna(aq,how="any"))
>> If I had to guess.. dropna(), I believe, drops null values. I suppose its
>> possible that createDataFrame converts R's <NA> values to null, so dropna()
>> works with that. But perhaps read.df() does not convert R <NA>s to null, as
>> those are most likely interpreted as strings when they come in from the
>> csv. Just a guess, can anyone confirm?
>> Deb
>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <
>>> wrote:
>>> Hi,
>>> I have applied the following code on airquality dataset available in R ,
>>> which has some missing values. I want to omit the rows which has NAs
>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>> sc <- sparkR.init("local",sparkHome =
>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>> sqlContext <- sparkRSQL.init(sc)
>>> path<-"/Users/devesh/work/airquality/"
>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>>> header="true", inferSchema="true")
>>> head(dropna(aq,how="any"))
>>> I am getting the output as
>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5
>>> 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA
>>> 14.9 66 5 6
>>> The NAs still exist in the output. Am I missing something here?
>>> --
>>> Warm regards,
>>> Devesh.
> --
> Warm regards,
> Devesh.

Reply via email to