"isnan" ends up using a case class, subclass of UnaryExpression, called "IsNaN" which evaluates each row of the column like this:
- *False* if the value is Null - Check the "Expression.Type" (apparently a Spark thing, not a Scala thing.. still learning here) - DoubleType: cast to Double and retrieve .isNaN - FloatType: cast to Float and retrieve .isNaN - Casting done by value.asInstanceOf[T] What's interesting is the "inputTypes" for this class are only DoubleType and FloatType. Unfortunately, I haven't figured out how the code would handle a String. Maybe someone could tell us how these Expressions work? In any case, we're not getting *True* back unless the value x casted to a Double actually returns Double.NaN. Strings casted to Double return errors (not Double.NaN) and the '-' character casted to Double returns 45 (!). On Thu, Sep 29, 2016 at 7:45 AM, Michael Segel <msegel_had...@hotmail.com> wrote: > Hi, > > Just a few thoughts so take it for what its worth… > > Databases have static schemas and will reject a row’s column on insert. > > In your case… you have one data set where you have a column which is > supposed to be a number but you have it as a string. > You want to convert this to a double in your final data set. > > > It looks like your problem is that your original data set that you > ingested used a ‘-‘ (dash) to represent missing data, rather than a NULL > value. > In fact, looking at the rows… you seem to have a stock that didn’t trade > for a given day. (All have Volume as 0. ) Why do you need this? Wouldn’t > you want to represent this as null or no row for a given date? > > The reason your ‘-‘ check failed when isnan() is that ‘-‘ actually could > be represented as a number. > > If you replaced the ‘-‘ with a String that is wider than the width of a > double … the isnan should flag the row. > > (I still need more coffee, so I could be wrong) ;-) > > HTH > > -Mike > > On Sep 28, 2016, at 5:56 AM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > > This is an issue in most databases. Specifically if a field is NaN.. --> ( > *NaN*, standing for not a number, is a numeric data type value > representing an undefined or unrepresentable value, especially in > floating-point calculations) > > There is a method called isnan() in Spark that is supposed to handle this > scenario . However, it does not return correct values! For example I > defined column "Open" as String (it should be Float) and it has the > following 7 rogue entries out of 1272 rows in a csv > > df2.filter( $"OPen" === > "-").select((changeToDate("TradeDate").as("TradeDate")), > 'Open, 'High, 'Low, 'Close, 'Volume).show > > +----------+----+----+---+-----+------+ > | TradeDate|Open|High|Low|Close|Volume| > +----------+----+----+---+-----+------+ > |2011-12-23| -| -| -|40.56| 0| > |2011-04-21| -| -| -|45.85| 0| > |2010-12-30| -| -| -|38.10| 0| > |2010-12-23| -| -| -|38.36| 0| > |2008-04-30| -| -| -|32.39| 0| > |2008-04-29| -| -| -|33.05| 0| > |2008-04-28| -| -| -|32.60| 0| > +----------+----+----+---+-----+------+ > > However, the following does not work! > > df2.filter(isnan($"Open")).show > +-----+------+---------+----+----+---+-----+------+ > |Stock|Ticker|TradeDate|Open|High|Low|Close|Volume| > +-----+------+---------+----+----+---+-----+------+ > +-----+------+---------+----+----+---+-----+------+ > > Any suggestions? > > Thanks > > > Dr Mich Talebzadeh > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > http://talebzadehmich.wordpress.com > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > >