alamb commented on code in PR #14211:
URL: https://github.com/apache/datafusion/pull/14211#discussion_r1927810712


##########
datafusion/functions/Cargo.toml:
##########
@@ -82,9 +82,11 @@ hex = { version = "0.4", optional = true }
 itertools = { workspace = true }
 log = { workspace = true }
 md-5 = { version = "^0.10.0", optional = true }
+memchr = { version = "2.7.4", optional = true }
 rand = { workspace = true }
 regex = { workspace = true, optional = true }
 sha2 = { version = "^0.10.1", optional = true }
+stringzilla = { version = "3", optional = true }

Review Comment:
   I also think @samuelcolvin 's experience was that the different 
implemnetations were faster/slower on some architectures / cahce sizes. So 
unless we have significant evidence this is faster for most/all platforms I 
would be hesitant to include it



##########
datafusion/functions/benches/contains.rs:
##########
@@ -0,0 +1,162 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review Comment:
   Would it be possible to break this benchmark into its own PR?
   
   This is mostly a selfish ask as I have comparison scripts for performance 
that compare performance against main, and they only work if the benchmark is 
on main as well



##########
.github/workflows/rust.yml:
##########
@@ -286,11 +286,17 @@ jobs:
         uses: ./.github/actions/setup-builder
         with:
           rust-version: stable
+      - name: Install emscripten dependency

Review Comment:
   what is this needed for?



##########
datafusion/functions/Cargo.toml:
##########
@@ -82,9 +82,11 @@ hex = { version = "0.4", optional = true }
 itertools = { workspace = true }
 log = { workspace = true }
 md-5 = { version = "^0.10.0", optional = true }
+memchr = { version = "2.7.4", optional = true }
 rand = { workspace = true }
 regex = { workspace = true, optional = true }
 sha2 = { version = "^0.10.1", optional = true }
+stringzilla = { version = "3", optional = true }

Review Comment:
   
   My biggest concern is these new library dependencies as our dependency. I 
haven't looked at it in detail but I think we need to evaluate how likely it is 
for it to be maintained, etc
   
   https://github.com/ashvardanian/stringzilla looks pretty solild though after 
a quick look
   
   Also, I think we should be planning to upstream as many of the low level 
operations (like string contains) as possible to arrow-rs (so we can have a 
larger user base to help improve them)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to