[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645862#comment-16645862 ] Wenchen Fan commented on SPARK-25378: - Since TF 1.12 will be released soon, and I believe 2.4 release won't be out before that, so marked it as won't fix. > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634335#comment-16634335 ] Xiangrui Meng commented on SPARK-25378: --- I don't think I'm the right person to decide here because I know little about how UTF8String is being used in Spark SQL. As a user, I do want to use spark-tensorflow-connector w/ the upcoming Spark 2.4 release. I already made the change in TF connector to use ObjectType: https://github.com/tensorflow/ecosystem/pull/100. But they need to wait for TF 1.12 release, which might come out in the second half of Oct. If we won't make the final 2.4 release by then, maybe we don't have to fix 2.4 branch. The risk is other data sources might have similar usage that will break, which we don't really know. > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628231#comment-16628231 ] Wenchen Fan commented on SPARK-25378: - [~mengxr] what do you think? This is not a real compatibility issue, but is more like a special case for Spark's adoption. > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628225#comment-16628225 ] Liang-Chi Hsieh commented on SPARK-25378: - Don't we have any decision on this yet? > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618588#comment-16618588 ] Liang-Chi Hsieh commented on SPARK-25378: - Hmm.. have we decided to include a fixing into 2.4? > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614200#comment-16614200 ] Liang-Chi Hsieh commented on SPARK-25378: - The fix looks like: https://github.com/apache/spark/compare/master...viirya:SPARK-25378?expand=1 If this looks ok, I can submit a PR with it. cc [~mengxr] [~cloud_fan] [~hyukjin.kwon]. Please let me know. Thanks. > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613157#comment-16613157 ] Liang-Chi Hsieh commented on SPARK-25378: - I think a quick fix is to use general `get` method for just `StringType` in `InternalRow.getAccessor`. This can allow the backward-compatible behavior for `StringType` when calling `toArray`. And we may consider to correct to `getUTF8String` by 3.0. WDYT? [~mengxr][~cloud_fan][~hyukjin.kwon] > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612965#comment-16612965 ] Hyukjin Kwon commented on SPARK-25378: -- {quote} If it is not pubic, why didn't we hide it in the first place? {quote} Because we already state the package itself it not meant to be public .. - https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L21-L22 These modifiers were removed in SPARK-16813 for this reason. > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612947#comment-16612947 ] Wenchen Fan commented on SPARK-25378: - [~viirya] Can you take a look and see how hard it is to fix it? After a quick look, I think this works in 2.3 if and only if: the `GenericArrayData` is created with `Array[String]` (i.e. a malformed ArrayData), and we wrongly call the `toArray[String](StringType)` method. A quick solution is to revert SPARK-23875 from 2.4, but then we sacrifice performance to retain a buggy but backward-compatible behavior. So we need to make a trade off here. > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612399#comment-16612399 ] Xiangrui Meng commented on SPARK-25378: --- Comments from [~vomjom] at https://github.com/tensorflow/ecosystem/pull/100: {quote} We currently only do releases along with TensorFlow releases, and the next one that'll include this is TF 1.12. {quote} This means Spark+TF users cannot migrate to Spark 2.4 until TF 1.12 is released. I think we need to decide based on the impact instead of just saying "this is not a public API". If it is not pubic, why didn't we hide it in the first place? And as [~cloud_fan] mentioned, it is hard to implement data source without touching those "private" APIs. > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611538#comment-16611538 ] Hyukjin Kwon commented on SPARK-25378: -- If there's a simple way to fix, it might be okay but still it's not a public API ... > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org