Re: Treadting NaN fields in Spark

2016-09-29 Thread Mich Talebzadeh
Thanks Michael. I realised that just checking for Volume > 0 should do val rs = df2.filter($"Volume".cast("Integer") > 0) will do, Your point on Again why not remove the rows where the volume of trades is 0? Are you referring to below scala> val rs = df2.filter($"Volume".cast("Integer") ===

Fwd: tod...@yahoo-inc.com is no longer with Yahoo! (was: Re: Treadting NaN fields in Spark)

2016-09-29 Thread Michael Segel
ostmas...@yahoo-inc.com>> Subject: tod...@yahoo-inc.com<mailto:tod...@yahoo-inc.com> is no longer with Yahoo! (was: Re: Treadting NaN fields in Spark) Date: September 29, 2016 at 10:56:10 AM CDT To: <msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> This is an automat

Re: Treadting NaN fields in Spark

2016-09-29 Thread Michael Segel
On Sep 29, 2016, at 10:29 AM, Mich Talebzadeh > wrote: Good points :) it took take "-" as a negative number -123456? Yeah… you have to go down a level and start to remember that you’re dealing with a stream or buffer of bytes below

Re: Treadting NaN fields in Spark

2016-09-29 Thread Mich Talebzadeh
Good points :) it took take "-" as a negative number -123456? At this moment in time this is what the code does 1. csv is imported into HDFS as is. No cleaning done for rogue columns done at shell level 2. Spark programs does the following filtration: 3. val rs = df2.filter($"Open"

Re: Treadting NaN fields in Spark

2016-09-29 Thread Peter Figliozzi
"isnan" ends up using a case class, subclass of UnaryExpression, called "IsNaN" which evaluates each row of the column like this: - *False* if the value is Null - Check the "Expression.Type" (apparently a Spark thing, not a Scala thing.. still learning here) - DoubleType: cast to

Re: Treadting NaN fields in Spark

2016-09-29 Thread Michael Segel
Hi, Just a few thoughts so take it for what its worth… Databases have static schemas and will reject a row’s column on insert. In your case… you have one data set where you have a column which is supposed to be a number but you have it as a string. You want to convert this to a double in your

Re: Treadting NaN fields in Spark

2016-09-28 Thread Marco Mistroni
Hi Dr Mich, how bout reading all csv as string and then applying an UDF sort of like this? import scala.util.control.Exception.allCatch def getDouble(doubleStr:String):Double = allCatch opt doubleStr.toDouble match { case Some(doubleNum) => doubleNum case _ => Double.NaN }

Re: Treadting NaN fields in Spark

2016-09-28 Thread Peter Figliozzi
In Scala, x.isNaN returns true for Double.NaN, but false for any character. I guess the `isnan` function you are using works by ultimately looking at x.isNan. On Wed, Sep 28, 2016 at 5:56 AM, Mich Talebzadeh wrote: > > This is an issue in most databases. Specifically

Treadting NaN fields in Spark

2016-09-28 Thread Mich Talebzadeh
This is an issue in most databases. Specifically if a field is NaN.. --> ( *NaN*, standing for not a number, is a numeric data type value representing an undefined or unrepresentable value, especially in floating-point calculations) There is a method called isnan() in Spark that is supposed to