[
https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115459#comment-15115459
]
Stephen DiCocco commented on SPARK-12911:
-----------------------------------------
So we have determined one way to work around the issue is to add the array you
want to search for as a literal column on the dataframe and then cache the
frame. This causes the underlying types of both to be UnsafeArrayData.
{code}
test("test array comparison") {
val vectors: Vector[Row] = Vector(
Row.fromTuple("id_1" -> Array(0L, 2L)),
Row.fromTuple("id_2" -> Array(0L, 5L)),
Row.fromTuple("id_3" -> Array(0L, 9L)),
Row.fromTuple("id_4" -> Array(1L, 0L)),
Row.fromTuple("id_5" -> Array(1L, 8L)),
Row.fromTuple("id_6" -> Array(2L, 4L)),
Row.fromTuple("id_7" -> Array(5L, 6L)),
Row.fromTuple("id_8" -> Array(6L, 2L)),
Row.fromTuple("id_9" -> Array(7L, 0L))
)
val data: RDD[Row] = sc.parallelize(vectors, 3)
val schema = StructType(
StructField("id", StringType, false) ::
StructField("point", DataTypes.createArrayType(LongType, false), false)
::
Nil
)
val sqlContext = new SQLContext(sc)
val dataframe = sqlContext.createDataFrame(data, schema)
val targetPoint:Array[Long] = Array(0L,9L)
//Adding the target column to the frame allows you to do the comparison
successfully but there is definite overhead to doing this!!!!
dataframe = dataframe.withColumn("target", array(targetPoint.map(value =>
lit(value)): _*))
dataframe.cache()
//This is the line where it fails
//java.util.NoSuchElementException: next on empty iterator
//However we know that there is a valid match
val targetRow = dataframe.where(dataframe("point") ===
dataframe("target").first()
assert(targetRow != null)
}
{code}
> Cacheing a dataframe causes array comparisons to fail (in filter / where)
> after 1.6
> -----------------------------------------------------------------------------------
>
> Key: SPARK-12911
> URL: https://issues.apache.org/jira/browse/SPARK-12911
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.0
> Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0
> Reporter: Jesse English
>
> When doing a *where* operation on a dataframe and testing for equality on an
> array type, after 1.6 no valid comparisons are made if the dataframe has been
> cached. If it has not been cached, the results are as expected.
> This appears to be related to the underlying unsafe array data types.
> {code:title=test.scala|borderStyle=solid}
> test("test array comparison") {
> val vectors: Vector[Row] = Vector(
> Row.fromTuple("id_1" -> Array(0L, 2L)),
> Row.fromTuple("id_2" -> Array(0L, 5L)),
> Row.fromTuple("id_3" -> Array(0L, 9L)),
> Row.fromTuple("id_4" -> Array(1L, 0L)),
> Row.fromTuple("id_5" -> Array(1L, 8L)),
> Row.fromTuple("id_6" -> Array(2L, 4L)),
> Row.fromTuple("id_7" -> Array(5L, 6L)),
> Row.fromTuple("id_8" -> Array(6L, 2L)),
> Row.fromTuple("id_9" -> Array(7L, 0L))
> )
> val data: RDD[Row] = sc.parallelize(vectors, 3)
> val schema = StructType(
> StructField("id", StringType, false) ::
> StructField("point", DataTypes.createArrayType(LongType, false),
> false) ::
> Nil
> )
> val sqlContext = new SQLContext(sc)
> val dataframe = sqlContext.createDataFrame(data, schema)
> val targetPoint:Array[Long] = Array(0L,9L)
> //Cacheing is the trigger to cause the error (no cacheing causes no error)
> dataframe.cache()
> //This is the line where it fails
> //java.util.NoSuchElementException: next on empty iterator
> //However we know that there is a valid match
> val targetRow = dataframe.where(dataframe("point") ===
> array(targetPoint.map(value => lit(value)): _*)).first()
> assert(targetRow != null)
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]