Github user javadba commented on the pull request:
https://github.com/apache/spark/pull/1586#issuecomment-50289795
The updated code got caught by one of the cases in the Hive compatibility
suite.
The Hive UDF length calculation appears to be different than the new one
implemented, presumably due to differences in handling of character encoding.
For the fix: I will make the length() function use the same character encoding
as does Hive to keep it compatible. The strlen() method will be the "outlet"
to permit flexible handling of multi byte character sets in the general RDD (no
strlen method is defined in hive proper).
I am going to roll back just the hive portion of the commit, and will
report back end of evening.
udf_length *** FAILED ***
[info] Results do not match for udf_length:
[info] SELECT length(dest1.name) FROM dest1
[info] == Logical Plan ==
[info] Project [Length(name#41188) AS c_0#41186]
[info] MetastoreRelation default, dest1, None
[info]
[info] == Optimized Logical Plan ==
[info] Project [Length(name#41188) AS c_0#41186]
[info] MetastoreRelation default, dest1, None
[info]
[info] == Physical Plan ==
[info] Project [Length(name#41188:0) AS c_0#41186]
[info] HiveTableScan [name#41188], (MetastoreRelation default, dest1,
None), None
[info] c_0
[info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) ==
[info] !2 6 (HiveComparisonTest.scala:366)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---