Re: specifing schema on dataframe

2017-02-06 Thread Sam Elamin
Ah ok Thanks for clearing it up Ayan! i will give that a go Thank you all for your help, this mailing list is awesome! On Mon, Feb 6, 2017 at 9:07 AM, ayan guha wrote: > If I am not missing anything here, "So I know which columns are numeric > and which arent because I

Re: specifing schema on dataframe

2017-02-06 Thread ayan guha
If I am not missing anything here, "So I know which columns are numeric and which arent because I have a StructType and all the internal StructFields will tell me which ones have a DataType which is numeric and which arent" will lead to getting to a list of fields which should be numeric.

Re: specifing schema on dataframe

2017-02-06 Thread Sam Elamin
Yup sorry I should have explained myself better So I know which columns are numeric and which arent because I have a StructType and all the internal StructFields will tell me which ones have a DataType which is numeric and which arent So assuming I have a json string which has double quotes on

Re: specifing schema on dataframe

2017-02-06 Thread ayan guha
UmmI think the premise is you need to "know" beforehand which columns are numeric.Unless you know it, how would you apply the schema? On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin wrote: > Thanks ayan but I meant how to derive the list automatically > > In your

Re: specifing schema on dataframe

2017-02-06 Thread Sam Elamin
Thanks ayan but I meant how to derive the list automatically In your example you are specifying the numeric columns and I would like it to be applied to any schema if that makes sense On Mon, 6 Feb 2017 at 08:49, ayan guha wrote: > SImple (pyspark) example: > > >>> df =

Re: specifing schema on dataframe

2017-02-06 Thread ayan guha
SImple (pyspark) example: >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json") >>> df.printSchema() root |-- customerid: string (nullable = true) |-- foo: string (nullable = true) >>> numeric_field_list = ['customerid'] >>> for k in numeric_field_list: ... df =

Re: specifing schema on dataframe

2017-02-05 Thread Sam Elamin
Ok thanks Micheal! Can I get an idea on where to start? Assuming I have the end schema and the current dataframe... How can I loop through it and create a new dataframe using the WithColumn? Am I iterating through the dataframe or the schema? I'm assuming it's easier to iterate through the

Re: specifing schema on dataframe

2017-02-05 Thread Michael Armbrust
If you already have the expected schema, and you know that all numbers will always be formatted as strings in the input JSON, you could probably derive this list automatically. Wouldn't it be simpler to just regex replace the numbers to remove the > quotes? I think this is likely to be a slower

Re: specifing schema on dataframe

2017-02-05 Thread Sam Elamin
I see so for the connector I need to pass in an array/list of numerical columns? Wouldnt it be simpler to just regex replace the numbers to remove the quotes? Regards Sam On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust wrote: > Specifying the schema when parsing

Re: specifing schema on dataframe

2017-02-05 Thread Michael Armbrust
Specifying the schema when parsing JSON will only let you pick between similar datatypes (i.e should this be a short, long float, double etc). It will not let you perform conversions like string <-> number. This has to be done with explicit casts after the data has been loaded. I think you can

Re: specifing schema on dataframe

2017-02-05 Thread Sam Elamin
Thanks Micheal I've been spending the past few days researching this The problem is the generated json has double quotes on fields that are numbers because the producing datastore doesn't want to lose precision I can change the data type true but that would be on specific to a job rather than a

Re: specifing schema on dataframe

2017-02-05 Thread Michael Armbrust
-dev You can use withColumn to change the type after the data has been loaded . On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin

Re: specifing schema on dataframe

2017-02-04 Thread Sam Elamin
Hi Direceu Thanks your right! that did work But now im facing an even bigger problem since i dont have access to change the underlying data, I just want to apply a schema over something that was written via the sparkContext.newAPIHadoopRDD Basically I am reading in a RDD[JsonObject] and would

Re: specifing schema on dataframe

2017-02-04 Thread Dirceu Semighini Filho
Hi Sam Remove the " from the number that it will work Em 4 de fev de 2017 11:46 AM, "Sam Elamin" escreveu: > Hi All > > I would like to specify a schema when reading from a json but when trying > to map a number to a Double it fails, I tried FloatType and IntType with

specifing schema on dataframe

2017-02-04 Thread Sam Elamin
Hi All I would like to specify a schema when reading from a json but when trying to map a number to a Double it fails, I tried FloatType and IntType with no joy! When inferring the schema customer id is set to String, and I would like to cast it as Double so df1 is corrupted while df2 shows