Re: Treadting NaN fields in Spark

Peter Figliozzi Thu, 29 Sep 2016 08:21:04 -0700

"isnan" ends up using a case class, subclass of UnaryExpression, called
"IsNaN" which evaluates each row of the column like this:


   - *False* if the value is Null
   - Check the "Expression.Type" (apparently a Spark thing, not a Scala
   thing.. still learning here)
   - DoubleType:  cast to Double and retrieve .isNaN
   - FloatType: cast to Float and retrieve .isNaN
   - Casting done by value.asInstanceOf[T]

What's interesting is the "inputTypes" for this class are only DoubleType
and FloatType.  Unfortunately, I haven't figured out how the code would
handle a String.  Maybe someone could tell us how these Expressions work?

In any case, we're not getting *True* back unless the value x casted to a
Double actually returns Double.NaN.  Strings casted to Double return errors
(not Double.NaN) and the '-' character casted to Double returns 45 (!).

On Thu, Sep 29, 2016 at 7:45 AM, Michael Segel <msegel_had...@hotmail.com>
wrote:

> Hi,
>
> Just a few thoughts so take it for what its worth…
>
> Databases have static schemas and will reject a row’s column on insert.
>
> In your case… you have one data set where you have a column which is
> supposed to be a number but you have it as a string.
> You want to convert this to a double in your final data set.
>
>
> It looks like your problem is that your original data set that you
> ingested used a ‘-‘ (dash) to represent missing data, rather than a NULL
> value.
> In fact, looking at the rows… you seem to have a stock that didn’t trade
> for a given day. (All have Volume as 0. ) Why do you need this?  Wouldn’t
> you want to represent this as null or no row for a given date?
>
> The reason your ‘-‘ check failed when isnan() is that ‘-‘ actually could
> be represented as a number.
>
> If you replaced the ‘-‘ with a String that is wider than the width of a
> double … the isnan should flag the row.
>
> (I still need more coffee, so I could be wrong) ;-)
>
> HTH
>
> -Mike
>
> On Sep 28, 2016, at 5:56 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>
> This is an issue in most databases. Specifically if a field is NaN.. --> (
> *NaN*, standing for not a number, is a numeric data type value
> representing an undefined or unrepresentable value, especially in
> floating-point calculations)
>
> There is a method called isnan() in Spark that is supposed to handle this
> scenario . However, it does not return correct values! For example I
> defined column "Open" as String  (it should be Float) and it has the
> following 7 rogue entries out of 1272 rows in a csv
>
> df2.filter( $"OPen" === 
> "-").select((changeToDate("TradeDate").as("TradeDate")),
> 'Open, 'High, 'Low, 'Close, 'Volume).show
>
> +----------+----+----+---+-----+------+
> | TradeDate|Open|High|Low|Close|Volume|
> +----------+----+----+---+-----+------+
> |2011-12-23|   -|   -|  -|40.56|     0|
> |2011-04-21|   -|   -|  -|45.85|     0|
> |2010-12-30|   -|   -|  -|38.10|     0|
> |2010-12-23|   -|   -|  -|38.36|     0|
> |2008-04-30|   -|   -|  -|32.39|     0|
> |2008-04-29|   -|   -|  -|33.05|     0|
> |2008-04-28|   -|   -|  -|32.60|     0|
> +----------+----+----+---+-----+------+
>
> However, the following does not work!
>
>  df2.filter(isnan($"Open")).show
> +-----+------+---------+----+----+---+-----+------+
> |Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
> +-----+------+---------+----+----+---+-----+------+
> +-----+------+---------+----+----+---+-----+------+
>
> Any suggestions?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>

Re: Treadting NaN fields in Spark

Reply via email to