[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6

Stephen DiCocco (JIRA) Wed, 20 Jan 2016 08:12:41 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15108803#comment-15108803
 ]


Stephen DiCocco commented on SPARK-12911:
-----------------------------------------

I've been working on Jesse with this.  What is happening is a datatype 
mismatch.  When you do not cache the underlying types of the array in the 
dataframe and the literal array are GenericArrayData.  However when you cache 
the dataframe the underlying type of the of the array in the dataframe is now 
an UnsafeArrayData.  The literal however is still a GenericArrayData so the 
comparison fails.  

Is there a method to define that constant array that would make it an 
UnsafeArrayData?

It seems counterIntuitive that the developer should need that much insight into 
the underlying storage of the array in order to do comparisons properly?

Any direction would be really helpful.

Thanks

> Cacheing a dataframe causes array comparisons to fail (in filter / where) 
> after 1.6
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-12911
>                 URL: https://issues.apache.org/jira/browse/SPARK-12911
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>         Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0
>            Reporter: Jesse English
>
> When doing a *where* operation on a dataframe and testing for equality on an 
> array type, after 1.6 no valid comparisons are made if the dataframe has been 
> cached.  If it has not been cached, the results are as expected.
> This appears to be related to the underlying unsafe array data types.
> {code:title=test.scala|borderStyle=solid}
> test("test array comparison") {
>     val vectors: Vector[Row] =  Vector(
>       Row.fromTuple("id_1" -> Array(0L, 2L)),
>       Row.fromTuple("id_2" -> Array(0L, 5L)),
>       Row.fromTuple("id_3" -> Array(0L, 9L)),
>       Row.fromTuple("id_4" -> Array(1L, 0L)),
>       Row.fromTuple("id_5" -> Array(1L, 8L)),
>       Row.fromTuple("id_6" -> Array(2L, 4L)),
>       Row.fromTuple("id_7" -> Array(5L, 6L)),
>       Row.fromTuple("id_8" -> Array(6L, 2L)),
>       Row.fromTuple("id_9" -> Array(7L, 0L))
>     )
>     val data: RDD[Row] = sc.parallelize(vectors, 3)
>     val schema = StructType(
>       StructField("id", StringType, false) ::
>         StructField("point", DataTypes.createArrayType(LongType, false), 
> false) ::
>         Nil
>     )
>     val sqlContext = new SQLContext(sc)
>     val dataframe = sqlContext.createDataFrame(data, schema)
>     val targetPoint:Array[Long] = Array(0L,9L)
>     //Cacheing is the trigger to cause the error (no cacheing causes no error)
>     dataframe.cache()
>     //This is the line where it fails
>     //java.util.NoSuchElementException: next on empty iterator
>     //However we know that there is a valid match
>     val targetRow = dataframe.where(dataframe("point") === 
> array(targetPoint.map(value => lit(value)): _*)).first()
>     assert(targetRow != null)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6

Reply via email to