> It is perfectly possible that the AI will largely or completely reproduce
> some existing GPL code for A, from its training data. There is no way that I
> could know that the AI has done that without some substantial research.
Even if it did, what if the common code were arrived at independently,
e.g. it wasn't used in training?
Searching by approximate text match seems to cover similarity, maybe
requiring a legal standard for this purpose. Aside from ML, the method
I'm familiar with involves cosine similarity on an n-dimensional vector
representing counts of, say, all 5-char sequences in the text, where N
becomes ~26^5. Any licensed code would be fingerprinted and checked for
license status before being added to an official database.
Bill
--
Phobrain.com
On 2024-07-04 03:50, Matthew Brett wrote:
> Sorry - reposting from my subscribed address:
>
> Hi,
>
> Sorry to top-post! But - I wanted to bring the discussion back to
> licensing. I have great sympathy for the ecological and code-quality
> concerns, but licensing is a separate question, and, it seems to me,
> an urgent question.
>
> Imagine I asked some AI to give me code to replicate a particular algorithm A.
>
> It is perfectly possible that the AI will largely or completely
> reproduce some existing GPL code for A, from its training data. There
> is no way that I could know that the AI has done that without some
> substantial research. Surely, this is a license violation of the GPL
> code? Let's say we accept that code. Others pick up the code and
> modify it for other algorithms. The code-base gets infected with GPL
> code, in a way that will make it very difficult to disentangle.
>
> Have we consulted a copyright lawyer on this? Specifically, have we
> consulted someone who advocates the GPL?
>
> Cheers,
>
> Matthew
>
> On Thu, Jul 4, 2024 at 11:27 AM Marten van Kerkwijk
> <m...@astro.utoronto.ca> wrote:
> Hi All,
>
> I agree with Dan that the actual contributions to the documentation are
> of little value: it is not easy to write good documentation, with
> examples that show not just the mechnanics but the purpose of the
> function, i.e., go well beyond just showing some random inputs and
> outputs. And poorly constructed examples are detrimental in that they
> just hide the fact that the documentation is bad.
>
> I also second his worries about ecological and social costs.
>
> But let me add a third issue: the costs to maintainers. I had a quick
> glance at some of those PRs when they were first posted, but basically
> decided they were not worth my time to review. For a human contributor,
> I might well have decided differently, since helping someone to improve
> their contribution often leads to higher quality further contributions.
> But here there seems to be no such hope.
>
> All the best,
>
> Marten
>
> Daniele Nicolodi <dani...@grinta.net> writes:
>
> On 03/07/24 23:40, Matthew Brett wrote: Hi,
>
> We recently got a set of well-labeled PRs containing (reviewed)
> AI-generated code:
>
> https://github.com/numpy/numpy/pull/26827
> https://github.com/numpy/numpy/pull/26828
> https://github.com/numpy/numpy/pull/26829
> https://github.com/numpy/numpy/pull/26830
> https://github.com/numpy/numpy/pull/26831
>
> Do we have a policy on AI-generated code? It seems to me that
> AI-code in general must be a license risk, as the AI may well generate
> code that was derived from, for example, code with a GPL-license.
> There is definitely the issue of copyright to keep in mind, but I see
> two other issues: the quality of the contributions and one moral issue.
>
> IMHO the PR linked above are not high quality contributions: for
> example, the added examples are often redundant with each other. In my
> experience these are representative of automatically generate content:
> as there is little to no effort involved into writing it, the content is
> often repetitive and with very low information density. In the case of
> documentation, I find this very detrimental to the overall quality.
>
> Contributions generated with AI have huge ecological and social costs.
> Encouraging AI generated contributions, especially where there is
> absolutely no need to involve AI to get to the solution, as in the
> examples above, makes the project co-responsible for these costs.
>
> Cheers,
> Dan
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: m...@astro.utoronto.ca
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: matthew.br...@gmail.com
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: bross_phobr...@sonic.net
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com