findepi commented on issue #14777: URL: https://github.com/apache/datafusion/issues/14777#issuecomment-2669313528
> can we provide a home for non-core functions where the community could maintain them outside of DataFusion core? you mean something like https://github.com/datafusion-contrib/datafusion-functions-extra? It has some downsides too (being non-Apache limits contribution from corporations; has unpredictable release cycles) That's not where I personally am feeling encouraged to contribute. but i'm not advocating for "open for all" approach either. > New functions will only be accepted in DataFusion if they fill in a gap compared to Postgresql or fill a gap identified by the community compared to an alternative systems such as DuckDB. In the later case the functions should be contributed as a group within an epic that fills out the specified gap (for example, the union* functions in DuckDB), not single functions coming in piecemeal. I agree it makes sense to add functions that close functionality gap to established popular systems. What's exactly an "established popular system"? For every potential contributor it will be _theirs system of choice_, se we need to apply some judgement. Perhaps based on "aggregated request rate" or "common themes". Spark-compatibility is so clearly a common theme that it makes sense to maintain Spark functions as part of this repo. PostgreSQL being our reference implementation - same DuckDB being our look-up to role model for arrays - same. I guess wuould be a few more on this list where we can expect some low-profile but sustained interest. > A new apache repository is setup (datafusion-additional-functions ?) where we provide the framework for adding, testing and packaging new functions but with the explicitly stated understanding that the maintenance of any functions contained have lower maintenance priority in the DataFusion team and releases may or may not coincide with DataFusion releases. Apache projects need to have PMC members who do and vote the releases. It's their duty to do releases with all the burden this entails (https://github.com/apache/datafusion/issues/14428). Is the release burden proportional to amount of code being shipped? Is it proportional to number of releases being made? Can the burden be minimized into pure automation? It's 21st century... I would prefer separate crates within this repository for the top-popular function collections. No more than 6. Alternatively, we could have separate repositories for each collection (including Spark's), so that interested community members can step up and review and eventually become subproject maintainers. --- We won't really know what's the actual costuntil we try things out. So whatever we feel is the best model, we should try out open minded accepting we can change later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org