alamb commented on code in PR #14211:
URL: https://github.com/apache/datafusion/pull/14211#discussion_r1927810712
##########
datafusion/functions/Cargo.toml:
##########
@@ -82,9 +82,11 @@ hex = { version = "0.4", optional = true }
itertools = { workspace = true }
log = { workspace = true }
md-5 = { version = "^0.10.0", optional = true }
+memchr = { version = "2.7.4", optional = true }
rand = { workspace = true }
regex = { workspace = true, optional = true }
sha2 = { version = "^0.10.1", optional = true }
+stringzilla = { version = "3", optional = true }
Review Comment:
I also think @samuelcolvin 's experience was that the different
implemnetations were faster/slower on some architectures / cahce sizes. So
unless we have significant evidence this is faster for most/all platforms I
would be hesitant to include it
##########
datafusion/functions/benches/contains.rs:
##########
@@ -0,0 +1,162 @@
+// Licensed to the Apache Software Foundation (ASF) under one
Review Comment:
Would it be possible to break this benchmark into its own PR?
This is mostly a selfish ask as I have comparison scripts for performance
that compare performance against main, and they only work if the benchmark is
on main as well
##########
.github/workflows/rust.yml:
##########
@@ -286,11 +286,17 @@ jobs:
uses: ./.github/actions/setup-builder
with:
rust-version: stable
+ - name: Install emscripten dependency
Review Comment:
what is this needed for?
##########
datafusion/functions/Cargo.toml:
##########
@@ -82,9 +82,11 @@ hex = { version = "0.4", optional = true }
itertools = { workspace = true }
log = { workspace = true }
md-5 = { version = "^0.10.0", optional = true }
+memchr = { version = "2.7.4", optional = true }
rand = { workspace = true }
regex = { workspace = true, optional = true }
sha2 = { version = "^0.10.1", optional = true }
+stringzilla = { version = "3", optional = true }
Review Comment:
My biggest concern is these new library dependencies as our dependency. I
haven't looked at it in detail but I think we need to evaluate how likely it is
for it to be maintained, etc
https://github.com/ashvardanian/stringzilla looks pretty solild though after
a quick look
Also, I think we should be planning to upstream as many of the low level
operations (like string contains) as possible to arrow-rs (so we can have a
larger user base to help improve them)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]