Re: question on transforms for spark 2.0 dataset
Subhash, Yea that did the trick thanks! On Wed, Mar 1, 2017 at 12:20 PM, Subhash Sriramwrote: > If I am understanding your problem correctly, I think you can just create > a new DataFrame that is a transformation of sample_data by first > registering sample_data as a temp table. > > //Register temp table > sample_data.createOrReplaceTempView("sql_sample_data") > > //Create new DataSet with transformed values > val transformed = spark.sql("select trim(field1) as field1, trim(field2) > as field2.. from sql_sample_data") > > //Test > transformed.show(10) > > I hope that helps! > Subhash > > > On Wed, Mar 1, 2017 at 12:04 PM, Marco Mistroni > wrote: > >> Hi I think u need an UDF if u want to transform a column >> Hth >> >> On 1 Mar 2017 4:22 pm, "Bill Schwanitz" wrote: >> >>> Hi all, >>> >>> I'm fairly new to spark and scala so bear with me. >>> >>> I'm working with a dataset containing a set of column / fields. The data >>> is stored in hdfs as parquet and is sourced from a postgres box so fields >>> and values are reasonably well formed. We are in the process of trying out >>> a switch from pentaho and various sql databases to pulling data into hdfs >>> and applying transforms / new datasets with processing being done in spark >>> ( and other tools - evaluation ) >>> >>> A rough version of the code I'm running so far: >>> >>> val sample_data = spark.read.parquet("my_data_input") >>> >>> val example_row = spark.sql("select * from parquet.my_data_input where >>> id = 123").head >>> >>> I want to apply a trim operation on a set of fields - lets call them >>> field1, field2, field3 and field4. >>> >>> What is the best way to go about applying those trims and creating a new >>> dataset? Can I apply the trip to all fields in a single map? or do I need >>> to apply multiple map functions? >>> >>> When I try the map ( even with a single ) >>> >>> scala> val transformed_data = sample_data.map( >>> | _.trim(col("field1")) >>> | .trim(col("field2")) >>> | .trim(col("field3")) >>> | .trim(col("field4")) >>> | ) >>> >>> I end up with the following error: >>> >>> :26: error: value trim is not a member of >>> org.apache.spark.sql.Row >>> _.trim(col("field1")) >>>^ >>> >>> Any ideas / guidance would be appreciated! >>> >> >
Re: question on transforms for spark 2.0 dataset
If I am understanding your problem correctly, I think you can just create a new DataFrame that is a transformation of sample_data by first registering sample_data as a temp table. //Register temp table sample_data.createOrReplaceTempView("sql_sample_data") //Create new DataSet with transformed values val transformed = spark.sql("select trim(field1) as field1, trim(field2) as field2.. from sql_sample_data") //Test transformed.show(10) I hope that helps! Subhash On Wed, Mar 1, 2017 at 12:04 PM, Marco Mistroniwrote: > Hi I think u need an UDF if u want to transform a column > Hth > > On 1 Mar 2017 4:22 pm, "Bill Schwanitz" wrote: > >> Hi all, >> >> I'm fairly new to spark and scala so bear with me. >> >> I'm working with a dataset containing a set of column / fields. The data >> is stored in hdfs as parquet and is sourced from a postgres box so fields >> and values are reasonably well formed. We are in the process of trying out >> a switch from pentaho and various sql databases to pulling data into hdfs >> and applying transforms / new datasets with processing being done in spark >> ( and other tools - evaluation ) >> >> A rough version of the code I'm running so far: >> >> val sample_data = spark.read.parquet("my_data_input") >> >> val example_row = spark.sql("select * from parquet.my_data_input where id >> = 123").head >> >> I want to apply a trim operation on a set of fields - lets call them >> field1, field2, field3 and field4. >> >> What is the best way to go about applying those trims and creating a new >> dataset? Can I apply the trip to all fields in a single map? or do I need >> to apply multiple map functions? >> >> When I try the map ( even with a single ) >> >> scala> val transformed_data = sample_data.map( >> | _.trim(col("field1")) >> | .trim(col("field2")) >> | .trim(col("field3")) >> | .trim(col("field4")) >> | ) >> >> I end up with the following error: >> >> :26: error: value trim is not a member of >> org.apache.spark.sql.Row >> _.trim(col("field1")) >>^ >> >> Any ideas / guidance would be appreciated! >> >
Re: question on transforms for spark 2.0 dataset
Hi I think u need an UDF if u want to transform a column Hth On 1 Mar 2017 4:22 pm, "Bill Schwanitz"wrote: > Hi all, > > I'm fairly new to spark and scala so bear with me. > > I'm working with a dataset containing a set of column / fields. The data > is stored in hdfs as parquet and is sourced from a postgres box so fields > and values are reasonably well formed. We are in the process of trying out > a switch from pentaho and various sql databases to pulling data into hdfs > and applying transforms / new datasets with processing being done in spark > ( and other tools - evaluation ) > > A rough version of the code I'm running so far: > > val sample_data = spark.read.parquet("my_data_input") > > val example_row = spark.sql("select * from parquet.my_data_input where id > = 123").head > > I want to apply a trim operation on a set of fields - lets call them > field1, field2, field3 and field4. > > What is the best way to go about applying those trims and creating a new > dataset? Can I apply the trip to all fields in a single map? or do I need > to apply multiple map functions? > > When I try the map ( even with a single ) > > scala> val transformed_data = sample_data.map( > | _.trim(col("field1")) > | .trim(col("field2")) > | .trim(col("field3")) > | .trim(col("field4")) > | ) > > I end up with the following error: > > :26: error: value trim is not a member of org.apache.spark.sql.Row > _.trim(col("field1")) >^ > > Any ideas / guidance would be appreciated! >
question on transforms for spark 2.0 dataset
Hi all, I'm fairly new to spark and scala so bear with me. I'm working with a dataset containing a set of column / fields. The data is stored in hdfs as parquet and is sourced from a postgres box so fields and values are reasonably well formed. We are in the process of trying out a switch from pentaho and various sql databases to pulling data into hdfs and applying transforms / new datasets with processing being done in spark ( and other tools - evaluation ) A rough version of the code I'm running so far: val sample_data = spark.read.parquet("my_data_input") val example_row = spark.sql("select * from parquet.my_data_input where id = 123").head I want to apply a trim operation on a set of fields - lets call them field1, field2, field3 and field4. What is the best way to go about applying those trims and creating a new dataset? Can I apply the trip to all fields in a single map? or do I need to apply multiple map functions? When I try the map ( even with a single ) scala> val transformed_data = sample_data.map( | _.trim(col("field1")) | .trim(col("field2")) | .trim(col("field3")) | .trim(col("field4")) | ) I end up with the following error: :26: error: value trim is not a member of org.apache.spark.sql.Row _.trim(col("field1")) ^ Any ideas / guidance would be appreciated!