MichaelChirico commented on issue #23888: Add a clarifying note to str_to_map documentation URL: https://github.com/apache/spark/pull/23888#issuecomment-467747788 Right, makes sense. The stuff I cited above (cwiki and both SO answers) are under Hive specifically, no mention of Spark. As to whether the behavior is _intended_ in Hive, it's unclear. But certainly the [implementation in Hive](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFStringToMap.java#L103-L106) does the same thing (uses `split` and therefore is regex): ``` String[] keyValuePairs = text.split(delimiter1); for (String keyValuePair : keyValuePairs) { String[] keyValue = keyValuePair.split(delimiter2, 2); ``` Further, here [1](https://github.com/apache/hive/blob/672755d6940923483f8d5a08959e182580edc72c/ql/src/test/results/clientpositive/str_to_map.q.out), [2](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/test/queries/clientpositive/vector_udf1.q), [3](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/test/queries/clientpositive/char_udf1.q), [4](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/test/queries/clientpositive/varchar_udf1.q), [5](https://github.com/apache/hive/blob/83e53972c07df8b7d9a01ad14dda5cb550406e87/ql/src/test/results/clientpositive/llap/varchar_udf1.q.out), [6](https://github.com/apache/hive/blob/f18842e7379060bb2504b0dbae4f4280abba883b/ql/src/test/results/clientpositive/char_udf1.q.out), [7](https://github.com/apache/hive/blob/8b968c7e46929c3af86da46e316faeb8d17f03df/ql/src/test/results/clientpositive/llap/vector_udf1.q.out), [8](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/test/queries/clientpositive/annotate_stats_select.q), [9](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/test/queries/clientpositive/vector_annotate_stats_select.q), [10](https://github.com/apache/hive/blob/672755d6940923483f8d5a08959e182580edc72c/ql/src/test/results/clientpositive/annotate_stats_select.q.out), [11](https://github.com/apache/hive/blob/672755d6940923483f8d5a08959e182580edc72c/ql/src/test/results/clientpositive/llap/vector_annotate_stats_select.q.out) are some tests of `str_to_map` but none of the tests use regex-ambiguous `delimiter{1,2}` values... My conclusion is that it's eminently ambiguous whether the _intended_ behavior in either Hive or SparkSQL is to treat the delimiters as regular expressions. BUT the behavior has been around for [8 years](https://github.com/apache/hive/commit/4f8294e578db449294a1186f0ac4efb041445dcb) and at least going off of the SO answers, it seems to be accepted as "known" behavior so things will probably break if we change it. I'll follow up this PR with one over at Hive to make sure things are in sync.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
