[
https://issues.apache.org/jira/browse/SPARK-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-11907:
---------------------------------
Labels: bulk-closed (was: )
> Allowing errors as values in DataFrames (like 'Either Left/Right')
> ------------------------------------------------------------------
>
> Key: SPARK-11907
> URL: https://issues.apache.org/jira/browse/SPARK-11907
> Project: Spark
> Issue Type: Wish
> Components: SQL
> Reporter: Tycho Grouwstra
> Priority: Major
> Labels: bulk-closed
>
> I like Spark, but one thing I find funny about it is that it is picky about
> circumstantial errors. For one, given the following:
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil
> val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let")
> val div = udf[Double, Integer](10 / _)
> df.withColumn("div", div(col("num"))).show()
> {code}
> ... the job fails with a `java.lang.ArithmeticException: / by zero`.
> The example is trivial, but my point is, if one thing goes wrong, the rest
> goes right, why throw out the baby with the bathwater when you could both
> show what went wrong as well as went right?
> Instead, I would propose allowing to use raised Exceptions as resulting
> values, not unlike how one might store 'bad' results using Either Left/Right
> constructions in Scala/Haskell (which I suppose would not currently work in
> DFs, lacking serializability), or cells containing errors in MS Excel.
> As a solution, I would propose a DataFrame subclass (?) using a variant of
> NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?).
> NullableColumnBuilder currently explains its workings as follows:
> {code}
> /**
> * A stackable trait used for building byte buffer for a column containing
> null values. Memory
> * layout of the final byte buffer is:
> * {{{
> * .------------------- Null count N (4 bytes)
> * | .--------------- Null positions (4 x N bytes, empty if null count
> is zero)
> * | | .--------- Non-null elements
> * V V V
> * +---+-----+---------+
> * | | ... | ... ... |
> * +---+-----+---------+
> * }}}
> */
> {code}
> This might be extended by adding a further section storing Throwables (or
> null) for the bad values in question (alt: store count/positions separately
> from null ones so null values would not need to be stored).
> Don't get me wrong, there is nothing with throwing exceptions (or catching
> them for that matter). Rather, I see a use cases for both "do it right or
> bust" vs. the explorative "show me what happens if I try this operation on
> these values" -- not unlike how languages as Ruby/Elixir might distinguish
> unsafe methods using a bang ('!') from their safe variants that should not
> throw global exceptions.
> I'm sort of new here but would be glad to get some opinions on this idea.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]