MichaelChirico commented on issue #23888: Add a clarifying note to str_to_map 
documentation
URL: https://github.com/apache/spark/pull/23888#issuecomment-467747788
 
 
   Right, makes sense. The stuff I cited above (cwiki and both SO answers) are 
under Hive specifically, no mention of Spark.
   
   As to whether the behavior is _intended_ in Hive, it's unclear. But 
certainly the [implementation in 
Hive](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFStringToMap.java#L103-L106)
 does the same thing (uses `split` and therefore is regex):
   
   ```
       String[] keyValuePairs = text.split(delimiter1);
   
       for (String keyValuePair : keyValuePairs) {
         String[] keyValue = keyValuePair.split(delimiter2, 2);
   ```
   
   Further, here 
[1](https://github.com/apache/hive/blob/672755d6940923483f8d5a08959e182580edc72c/ql/src/test/results/clientpositive/str_to_map.q.out),
 
[2](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/test/queries/clientpositive/vector_udf1.q),
 
[3](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/test/queries/clientpositive/char_udf1.q),
 
[4](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/test/queries/clientpositive/varchar_udf1.q),
 
[5](https://github.com/apache/hive/blob/83e53972c07df8b7d9a01ad14dda5cb550406e87/ql/src/test/results/clientpositive/llap/varchar_udf1.q.out),
 
[6](https://github.com/apache/hive/blob/f18842e7379060bb2504b0dbae4f4280abba883b/ql/src/test/results/clientpositive/char_udf1.q.out),
 
[7](https://github.com/apache/hive/blob/8b968c7e46929c3af86da46e316faeb8d17f03df/ql/src/test/results/clientpositive/llap/vector_udf1.q.out),
 
[8](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/test/queries/clientpositive/annotate_stats_select.q),
 
[9](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/test/queries/clientpositive/vector_annotate_stats_select.q),
 
[10](https://github.com/apache/hive/blob/672755d6940923483f8d5a08959e182580edc72c/ql/src/test/results/clientpositive/annotate_stats_select.q.out),
 
[11](https://github.com/apache/hive/blob/672755d6940923483f8d5a08959e182580edc72c/ql/src/test/results/clientpositive/llap/vector_annotate_stats_select.q.out)
 are some tests of `str_to_map` but none of the tests use regex-ambiguous 
`delimiter{1,2}` values...
   
   My conclusion is that it's eminently ambiguous whether the _intended_ 
behavior in either Hive or SparkSQL is to treat the delimiters as regular 
expressions.
   
   BUT the behavior has been around for [8 
years](https://github.com/apache/hive/commit/4f8294e578db449294a1186f0ac4efb041445dcb)
 and at least going off of the SO answers, it seems to be accepted as "known" 
behavior so things will probably break if we change it.
   
   I'll follow up this PR with one over at Hive to make sure things are in sync.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to