alamb commented on code in PR #14211: URL: https://github.com/apache/datafusion/pull/14211#discussion_r1927810712
########## datafusion/functions/Cargo.toml: ########## @@ -82,9 +82,11 @@ hex = { version = "0.4", optional = true } itertools = { workspace = true } log = { workspace = true } md-5 = { version = "^0.10.0", optional = true } +memchr = { version = "2.7.4", optional = true } rand = { workspace = true } regex = { workspace = true, optional = true } sha2 = { version = "^0.10.1", optional = true } +stringzilla = { version = "3", optional = true } Review Comment: I also think @samuelcolvin 's experience was that the different implemnetations were faster/slower on some architectures / cahce sizes. So unless we have significant evidence this is faster for most/all platforms I would be hesitant to include it ########## datafusion/functions/benches/contains.rs: ########## @@ -0,0 +1,162 @@ +// Licensed to the Apache Software Foundation (ASF) under one Review Comment: Would it be possible to break this benchmark into its own PR? This is mostly a selfish ask as I have comparison scripts for performance that compare performance against main, and they only work if the benchmark is on main as well ########## .github/workflows/rust.yml: ########## @@ -286,11 +286,17 @@ jobs: uses: ./.github/actions/setup-builder with: rust-version: stable + - name: Install emscripten dependency Review Comment: what is this needed for? ########## datafusion/functions/Cargo.toml: ########## @@ -82,9 +82,11 @@ hex = { version = "0.4", optional = true } itertools = { workspace = true } log = { workspace = true } md-5 = { version = "^0.10.0", optional = true } +memchr = { version = "2.7.4", optional = true } rand = { workspace = true } regex = { workspace = true, optional = true } sha2 = { version = "^0.10.1", optional = true } +stringzilla = { version = "3", optional = true } Review Comment: My biggest concern is these new library dependencies as our dependency. I haven't looked at it in detail but I think we need to evaluate how likely it is for it to be maintained, etc https://github.com/ashvardanian/stringzilla looks pretty solild though after a quick look Also, I think we should be planning to upstream as many of the low level operations (like string contains) as possible to arrow-rs (so we can have a larger user base to help improve them) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org