Hi there,

I'd like to bring your attention to a proposal being discussed among pandas
developers, regarding copy-on-write semantics.

A very short summary of the proposal, according to the document
<https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#>,
is:



*- The result of any indexing operation (subsetting a DataFrame or Series
in any way, i.e. including accessing a DataFrame column as a Series) or any
method returning a new DataFrame or Series, always behaves as if it were a
copy in terms of user API.- We implement Copy-on-Write (as implementation
detail). This way, we can actually use views as much as possible under the
hood, while ensuring the user API behaves as a copy.*
*- As a consequence, if you want to modify an object (DataFrame or Series),
the only way to do this is to modify that object itself directly.*



*This addresses multiple aspects: 1) a clear and consistent user API (a
clear rule: any subset or returned series/dataframe always behaves as a
copy of the original, and thus never modifies the original) and 2)
improving performance by avoiding excessive copies (eg a chained method
workflow would no longer return an actual data copy at each step). Because
every single indexing step behaves as a copy, this also means that with
this proposal, “chained assignment” (with multiple setitem steps) will
never work.*

You can also read the related discussion on the pandas mailing list here
<https://mail.python.org/pipermail/pandas-dev/2021-July/001358.html>. It
would be nice for us to think about the implications of this proposal on
our work related to supporting pandas dataframes.

Cheers,
Adrin
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to