Hi folks, I'd like to start a discussion on whether we should add a page to the Iceberg documentation describing expectations around AI-generated contributions.
This topic has recently been discussed on the Arrow dev mailing list[1]. In addition, the iceberg-cpp project has already taken a step in this direction by introducing AI-related contribution guidelines[2]. After a brief discussion on the iceberg-cpp's PR with Fokko, Gang, and Kevin, we felt it would be worthwhile to raise this topic more broadly within the Iceberg community. The ASF already provides high-level guidance on the use of generative AI tools, primarily focused on licensing and IP considerations[3]. As AI-assisted development and so-called "vibe coding" become more common, thoughtful use of these tools can be beneficial; however, if the contributing author appears not to have engaged deeply with the code and/or cannot respond to review feedback, this can significantly increase maintainer burden and make the review process less collaborative. Having documented guidelines would give maintainers a clear reference point when evaluating such contributions (including when deciding to close a PR), and would also make it easier to assess whether a contributor has made a reasonable effort to meet project expectations. I've pulled together some guidelines from iceberg-cpp's PR and discussions on the Arrow dev ML, hoping to kick off a broader conversation about what should go into Iceberg's AI-generated contribution guidelines. ----- We are not opposed to the use of AI tools in generating PRs, but we recommend that contributors adhere to the following principles: - The PR author should **understand the core ideas** behind the implementation **end-to-end**, and be able to justify the design and code during review. - **Calls out unknowns and assumptions**. It's okay to not fully understand some bits of AI generated code. You should comment on these cases and point them out to reviewers so that they can use their knowledge of the codebase to clear up any concerns. For example, you might comment "calling this function here seems to work but I'm not familiar with how it works internally, I wonder if there's a race condition if it is called concurrently". - Only submit a PR if you are able to debug, explain, and take ownership of the changes. - Ensure the PR title and description match the style, level of detail, and tone of other Iceberg PRs. - Follow coding conventions used in the rest of the codebase. - Be upfront about AI usage, including a brief summary of which parts were AI-generated. - Reference any sources that guided your changes (e.g. "took a similar approach to #XXXX"). ----- Looking forward to hearing your thoughts. [1] https://lists.apache.org/thread/fyn1r3hjd3cs48n2svxg7lj0zps52bvr [2] https://github.com/apache/iceberg-cpp/pull/531 [3] https://www.apache.org/legal/generative-tooling.html -- Regards Junwang Zhao
