seddonm1 commented on pull request #9428: URL: https://github.com/apache/arrow/pull/9428#issuecomment-793074013
Hi @sweb Yesterday I made my PR for `regexp_replace` (and others) see https://github.com/apache/arrow/pull/9654/files#diff-c122a83600dc86aa69067fdfdca1e0349616dfae73eaad8a8c90d5e69dbf7a3c You can see how I am passing in the values. I think there is opportunity to use more `lazy_static` to precompile any standard regex in that file to reduce runtime cost - but we need to see all the functions before too much optimisation. The way I have done the `regexp_replace` code (and I am not saying it is the best way) is that because potentially each row can be different (as any argument to a function in Postgres can actually be supplied by referencing column) I have tried to balance that cost by memoizing the Regex objects. I did write a lot of this code prior to @jorgecarleitao doing some large changes in the `functions.rs` relating to Scalar vs Columnar so there may be a second pass to optimise this once we get basic functionality working. I think the `regexp_match` is the way to go (as per Postgres) which does return a list of string values. We will then need to look at the sqlparser to add the ability to 'extract' values from the list by id: `[0]` (I was thinking of doing this soon). I need to do some playing in Postgres to fully understand the behaivor (what happens if you reference a non-existent index). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org