jorisvandenbossche commented on a change in pull request #8990:
URL: https://github.com/apache/arrow/pull/8990#discussion_r601492842



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -428,6 +433,26 @@ TYPED_TEST(TestStringKernels, 
StrptimeDoesNotProvideDefaultOptions) {
   ASSERT_RAISES(Invalid, CallFunction("strptime", {input}));
 }
 
+TYPED_TEST(TestStringKernels, Join) {
+  auto separator = this->scalar("--");
+  CheckScalarBinary(
+      "binary_join", list(this->type()),
+      R"([["a", "bb", "ccc"], [], null, ["dd"], ["eee", null], ["ff", ""]])", 
separator,
+      this->type(), R"(["a--bb--ccc", "", null, "dd", null, "ff--"])");

Review comment:
       Looking at how pandas / R does this right now:
   
   Pandas actually results in a missing value:
   
   ```
   In [7]: s = pd.Series([["a", None, "b"], ["c", "d", "e"]])
   
   In [8]: s.str.join("-")
   Out[8]: 
   0      NaN
   1    c-d-e
   dtype: object
   ```
   
   but I doubt that this is necessarily intentional (when it comes to missing 
values). The docstring basically says that if there any non-string element in a 
list, the result is NaN for that list. And I assume this is just because in 
that case the stdlib `"-".join(..)` which is used under the hood fails.
   
   For R, I am not fully sure what's the most "typical" way to do this, but 
with the basic `paste`, you actually get the string representation of NA:
   
   ```
   > paste("a", NA, "b", sep="-")
   [1] "a-NA-b"
   ```
   
   (but `paste` is more general in converting input to strings first). 
   Using a more specific string function from `stringr`, it actually results in 
NA:
   
   ```
   > str_c("a", "b", sep="-")
   [1] "a-b"
   > str_c("a", NA, "b", sep="-")
   [1] NA
   ```
   
   where the docstring explicitly mentions this (*"Like most other R functions, 
missing values are "infectious": whenever a missing value is combined with 
another string the result will always be missing"*). So that could be an 
argument for returning null. But since in Arrow the default is that nulls are 
*not* infectious, it could also be an argument to skip it ;)
   
   cc @ianmcook 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to