RE: different behavior while using createDataFrame and read.df in SparkR
I guess the problem is: dummy.df<-withColumn(dataframe,paste0(colnames(cat.column),j),ifelse(column[[1]]==levels(as.factor(unlist(cat.column)))[j],1,0) ) dataframe<-dummy.df Once dataframe is re-assigned to reference a new DataFrame in each iteration, the column variable has to be re-assigned to reference a column in the new DataFrame. From: Devesh Raj Singh [mailto:raj.deves...@gmail.com] Sent: Saturday, February 6, 2016 8:31 PM To: Sun, Rui <rui@intel.com> Cc: user@spark.apache.org Subject: Re: different behavior while using createDataFrame and read.df in SparkR Thank you ! Rui Sun for the observation! It helped. I have a new problem arising. When I create a small function for dummy variable creation for categorical column BDADummies<-function(dataframe,column){ cat.column<-vector(mode="character",length=nrow(dataframe)) cat.column<-collect(column) lev<-length(levels(as.factor(unlist(cat.column for (j in 1:lev){ dummy.df<-withColumn(dataframe,paste0(colnames(cat.column),j),ifelse(column[[1]]==levels(as.factor(unlist(cat.column)))[j],1,0) ) dataframe<-dummy.df } return(dataframe) } and when I call the function using newdummy.df<-BDADummies(df1,column=select(df1,df1$Species)) I get the below error Error in withColumn(dataframe, paste0(colnames(cat.column), j), ifelse(column[[1]] == : error in evaluating the argument 'col' in selecting a method for function 'withColumn': Error in if (le > 0) paste0("[1:", paste(le), "]") else "(0)" : argument is not interpretable as logical but when i use it without calling or creating a function , the statement dummy.df<-withColumn(dataframe,paste0(colnames(cat.column),j),ifelse(column[[1]]==levels(as.factor(unlist(cat.column)))[j],1,0) ) gives me the new columns generating column names as desired. Warm regards, Devesh. On Sat, Feb 6, 2016 at 7:09 AM, Sun, Rui <rui@intel.com<mailto:rui@intel.com>> wrote: I guess this is related to https://issues.apache.org/jira/browse/SPARK-11976 When calling createDataFrame on iris, the “.” Character in column names will be replaced with “_”. It seems that when you create a DataFrame from the CSV file, the “.” Character in column names are still there. From: Devesh Raj Singh [mailto:raj.deves...@gmail.com<mailto:raj.deves...@gmail.com>] Sent: Friday, February 5, 2016 2:44 PM To: user@spark.apache.org<mailto:user@spark.apache.org> Cc: Sun, Rui Subject: different behavior while using createDataFrame and read.df in SparkR Hi, I am using Spark 1.5.1 When I do this df <- createDataFrame(sqlContext, iris) #creating a new column for category "Setosa" df$Species1<-ifelse((df)[[5]]=="setosa",1,0) head(df) output: new column created Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa but when I saved the iris dataset as a CSV file and try to read it and convert it to sparkR dataframe df <- read.df(sqlContext,"/Users/devesh/Github/deveshgit2/bdaml/data/iris/", source = "com.databricks.spark.csv",header = "true",inferSchema = "true") now when I try to create new column df$Species1<-ifelse((df)[[5]]=="setosa",1,0) I get the below error: 16/02/05 12:11:01 ERROR RBackendHandler: col on 922 failed Error in select(x, x$"*", alias(col, colName)) : error in evaluating the argument 'col' in selecting a method for function 'select': Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Cannot resolve column name "Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species); at org.apache.spark.s -- Warm regards, Devesh. -- Warm regards, Devesh.
Re: different behavior while using createDataFrame and read.df in SparkR
Thank you ! Rui Sun for the observation! It helped. I have a new problem arising. When I create a small function for dummy variable creation for categorical column BDADummies<-function(dataframe,column){ cat.column<-vector(mode="character",length=nrow(dataframe)) cat.column<-collect(column) lev<-length(levels(as.factor(unlist(cat.column for (j in 1:lev){ dummy.df<-withColumn(dataframe,paste0(colnames(cat.column),j),ifelse(column[[1]]==levels(as.factor(unlist(cat.column)))[j],1,0) ) dataframe<-dummy.df } return(dataframe) } *and when I call the function using* newdummy.df<-BDADummies(df1,column=select(df1,df1$Species)) I get the below error Error in withColumn(dataframe, paste0(colnames(cat.column), j), ifelse(column[[1]] == : error in evaluating the argument 'col' in selecting a method for function 'withColumn': Error in if (le > 0) paste0("[1:", paste(le), "]") else "(0)" : argument is not interpretable as logical *but when i use it without calling or creating a function , the statement * dummy.df<-withColumn(dataframe,paste0(colnames(cat.column),j),ifelse(column[[1]]==levels(as.factor(unlist(cat.column)))[j],1,0) ) gives me the new columns generating column names as desired. Warm regards, Devesh. On Sat, Feb 6, 2016 at 7:09 AM, Sun, Rui <rui@intel.com> wrote: > I guess this is related to > https://issues.apache.org/jira/browse/SPARK-11976 > > > > When calling createDataFrame on iris, the “.” Character in column names > will be replaced with “_”. > > It seems that when you create a DataFrame from the CSV file, the “.” > Character in column names are still there. > > > > *From:* Devesh Raj Singh [mailto:raj.deves...@gmail.com] > *Sent:* Friday, February 5, 2016 2:44 PM > *To:* user@spark.apache.org > *Cc:* Sun, Rui > *Subject:* different behavior while using createDataFrame and read.df in > SparkR > > > > > Hi, > > > > I am using Spark 1.5.1 > > > > When I do this > > > > df <- createDataFrame(sqlContext, iris) > > > > #creating a new column for category "Setosa" > > > > df$Species1<-ifelse((df)[[5]]=="setosa",1,0) > > > > head(df) > > > > output: new column created > > > > Sepal.Length Sepal.Width Petal.Length Petal.Width Species > > 1 5.1 3.5 1.4 0.2 setosa > > 2 4.9 3.0 1.4 0.2 setosa > > 3 4.7 3.2 1.3 0.2 setosa > > 4 4.6 3.1 1.5 0.2 setosa > > 5 5.0 3.6 1.4 0.2 setosa > > 6 5.4 3.9 1.7 0.4 setosa > > > > *but when I saved the iris dataset as a CSV file and try to read it and > convert it to sparkR dataframe* > > > > df <- > read.df(sqlContext,"/Users/devesh/Github/deveshgit2/bdaml/data/iris/", > > source = "com.databricks.spark.csv",header = > "true",inferSchema = "true") > > > > now when I try to create new column > > > > df$Species1<-ifelse((df)[[5]]=="setosa",1,0) > > I get the below error: > > > > 16/02/05 12:11:01 ERROR RBackendHandler: col on 922 failed > > Error in select(x, x$"*", alias(col, colName)) : > > error in evaluating the argument 'col' in selecting a method for > function 'select': Error in invokeJava(isStatic = FALSE, objId$id, > methodName, ...) : > > org.apache.spark.sql.AnalysisException: Cannot resolve column name > "Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, > Species); > > at org.apache.spark.s > > -- > > Warm regards, > > Devesh. > -- Warm regards, Devesh.
RE: different behavior while using createDataFrame and read.df in SparkR
I guess this is related to https://issues.apache.org/jira/browse/SPARK-11976 When calling createDataFrame on iris, the “.” Character in column names will be replaced with “_”. It seems that when you create a DataFrame from the CSV file, the “.” Character in column names are still there. From: Devesh Raj Singh [mailto:raj.deves...@gmail.com] Sent: Friday, February 5, 2016 2:44 PM To: user@spark.apache.org Cc: Sun, Rui Subject: different behavior while using createDataFrame and read.df in SparkR Hi, I am using Spark 1.5.1 When I do this df <- createDataFrame(sqlContext, iris) #creating a new column for category "Setosa" df$Species1<-ifelse((df)[[5]]=="setosa",1,0) head(df) output: new column created Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa but when I saved the iris dataset as a CSV file and try to read it and convert it to sparkR dataframe df <- read.df(sqlContext,"/Users/devesh/Github/deveshgit2/bdaml/data/iris/", source = "com.databricks.spark.csv",header = "true",inferSchema = "true") now when I try to create new column df$Species1<-ifelse((df)[[5]]=="setosa",1,0) I get the below error: 16/02/05 12:11:01 ERROR RBackendHandler: col on 922 failed Error in select(x, x$"*", alias(col, colName)) : error in evaluating the argument 'col' in selecting a method for function 'select': Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Cannot resolve column name "Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species); at org.apache.spark.s -- Warm regards, Devesh.
different behavior while using createDataFrame and read.df in SparkR
Hi, I am using Spark 1.5.1 When I do this df <- createDataFrame(sqlContext, iris) #creating a new column for category "Setosa" df$Species1<-ifelse((df)[[5]]=="setosa",1,0) head(df) output: new column created Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa *but when I saved the iris dataset as a CSV file and try to read it and convert it to sparkR dataframe* df <- read.df(sqlContext,"/Users/devesh/Github/deveshgit2/bdaml/data/iris/", source = "com.databricks.spark.csv",header = "true",inferSchema = "true") now when I try to create new column df$Species1<-ifelse((df)[[5]]=="setosa",1,0) I get the below error: 16/02/05 12:11:01 ERROR RBackendHandler: col on 922 failed Error in select(x, x$"*", alias(col, colName)) : error in evaluating the argument 'col' in selecting a method for function 'select': Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Cannot resolve column name "Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species); at org.apache.spark.s -- Warm regards, Devesh.