Re: [I] Discussion: what new functions should and should not be accepted into DataFusion [datafusion]

via GitHub Wed, 19 Feb 2025 09:51:10 -0800


findepi commented on issue #14777:
URL: https://github.com/apache/datafusion/issues/14777#issuecomment-2669313528


   > can we provide a home for non-core functions where the community could 
maintain them outside of DataFusion core?
   
   you mean something like 
https://github.com/datafusion-contrib/datafusion-functions-extra?
   It has some downsides too (being non-Apache limits contribution from 
corporations; has unpredictable release cycles)
   That's not where I personally am feeling encouraged to contribute.
   
   but i'm not advocating for "open for all" approach either.
   
   > New functions will only be accepted in DataFusion if they fill in a gap 
compared to Postgresql or fill a gap identified by the community compared to an 
alternative systems such as DuckDB. In the later case the functions should be 
contributed as a group within an epic that fills out the specified gap (for 
example, the union* functions in DuckDB), not single functions coming in 
piecemeal.
   
   I agree it makes sense to add functions that close functionality gap to 
established popular systems.
   What's exactly an "established popular system"? For every potential 
contributor it will be _theirs system of choice_, se we need to apply some 
judgement. Perhaps based on "aggregated request rate" or "common themes".
   Spark-compatibility is so clearly a common theme that it makes sense to 
maintain Spark functions as part of this repo.
   PostgreSQL being our reference implementation - same
   DuckDB being our look-up to role model for arrays - same.
   I guess wuould be a few more on this list where we can expect some 
low-profile but sustained interest.
   
   > A new apache repository is setup (datafusion-additional-functions ?) where 
we provide the framework for adding, testing and packaging new functions but 
with the explicitly stated understanding that the maintenance of any functions 
contained have lower maintenance priority in the DataFusion team and releases 
may or may not coincide with DataFusion releases.
   
   Apache projects need to have PMC members who do and vote the releases. It's 
their duty to do releases with all the burden this entails 
(https://github.com/apache/datafusion/issues/14428).
   
   Is the release burden proportional to amount of code being shipped? Is it 
proportional to number of releases being made? Can the burden be minimized into 
pure automation? It's 21st century...
   
   I would prefer separate crates within this repository for the top-popular 
function collections. No more than 6.
   Alternatively, we could have separate repositories for each collection 
(including Spark's), so that interested community members can step up and 
review and eventually become subproject maintainers.
   
   ---
   
   We won't really know what's the actual costuntil we try things out. So 
whatever we feel is the best model, we should try out open minded accepting we 
can change later.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Discussion: what new functions should and should not be accepted into DataFusion [datafusion]

Reply via email to