wesm commented on a change in pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#discussion_r449032259



##########
File path: r/src/array_from_vector.cpp
##########
@@ -1150,6 +1150,25 @@ std::shared_ptr<arrow::DataType> 
InferArrowTypeFromVector<REALSXP>(SEXP x) {
   return float64();
 }
 
+template <>
+std::shared_ptr<arrow::DataType> InferArrowTypeFromVector<STRSXP>(SEXP x) {
+  // See how big the character vector is
+  R_xlen_t n = XLENGTH(x);
+  int64_t size = 0;
+  for (R_xlen_t i = 0; i < n; i++) {
+    SEXP string_i = STRING_ELT(x, i);
+    if (string_i != NA_STRING) {
+      size += XLENGTH(Rf_mkCharCE(Rf_translateCharUTF8(string_i), CE_UTF8));
+    }
+    if (size > 2147483646) {
+      // Exceeds 2GB capacity of utf8 type, so use large
+      return large_utf8();
+    }
+  }
+
+  return utf8();
+}

Review comment:
       This is very concerning from a perf perspective -- particularly to have 
to use the UTF-8 functions more than once.
   
   I can spend some time working this issue -- I can do a few things:
   
   * Only use these (guessing) expensive functions like `Rf_translateCharUTF8` 
if the string data is not UTF-8 (we have a fast ValidateUTF8 function that we 
can use). I'll need to check the perf of converting ASCII data with various 
approaches
   * Use a single conversion path for utf8/large_utf8 and only switch from the 
32-bit to 64-bit path (i.e. from `TypedBufferBuilder<int32_t>` to 
`TypedBufferBuilder<int64_t>` when hitting the memory limit)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to