That sounds like a good solution. Having the zero-copy behavior depending on whether you have only 1 column of a certain type or not, might lead to surprising results. To avoid yet another keyword, only doing it when split_blocks=True sounds good to me (in practice, that's also when it will happen mostly, except for very narrow dataframes with only few columns).
Joris On Thu, 16 Jan 2020 at 22:44, Wes McKinney <wesmck...@gmail.com> wrote: > hi Joris, > > Thanks for investigating this. It seems there were some unintended > consequences of the zero-copy optimizations from ARROW-3789. Another > way forward might be to "opt in" to this behavior, or to only do the > zero copy optimizations when split_blocks=True. What do you think? > > - Wes > > On Thu, Jan 16, 2020 at 3:42 AM Joris Van den Bossche > <jorisvandenboss...@gmail.com> wrote: > > > > So the spark integration build started to fail, and with the following > test > > error: > > > > ====================================================================== > > ERROR: test_toPandas_batch_order > > (pyspark.sql.tests.test_arrow.EncryptionArrowTests) > > ---------------------------------------------------------------------- > > Traceback (most recent call last): > > File "/spark/python/pyspark/sql/tests/test_arrow.py", line 422, in > > test_toPandas_batch_order > > run_test(*case) > > File "/spark/python/pyspark/sql/tests/test_arrow.py", line 409, in > run_test > > pdf, pdf_arrow = self._toPandas_arrow_toggle(df) > > File "/spark/python/pyspark/sql/tests/test_arrow.py", line 152, in > > _toPandas_arrow_toggle > > pdf_arrow = df.toPandas() > > File "/spark/python/pyspark/sql/pandas/conversion.py", line 115, in > toPandas > > return _check_dataframe_localize_timestamps(pdf, timezone) > > File "/spark/python/pyspark/sql/pandas/types.py", line 180, in > > _check_dataframe_localize_timestamps > > pdf[column] = _check_series_localize_timestamps(series, timezone) > > File > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py", > > line 3487, in __setitem__ > > self._set_item(key, value) > > File > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py", > > line 3565, in _set_item > > NDFrame._set_item(self, key, value) > > File > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/generic.py", > > line 3381, in _set_item > > self._data.set(key, value) > > File > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/managers.py", > > line 1090, in set > > blk.set(blk_locs, value_getitem(val_locs)) > > File > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/blocks.py", > > line 380, in set > > self.values[locs] = values > > ValueError: assignment destination is read-only > > > > > > It's from a test that is doing conversions from spark to arrow to pandas > > (so calling pyarrow.Table.to_pandas here > > < > https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/conversion.py#L111-L115 > >), > > and on the resulting DataFrame, it is iterating through all columns, > > potentially fixing timezones, and writing each column back into the > > DataFrame (here > > < > https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/types.py#L179-L181 > > > > ). > > > > Since it is giving an error about read-only, it might be related to > > zero-copy behaviour of to_pandas, and thus might be related to the > refactor > > of the arrow->pandas conversion that landed yesterday ( > > https://github.com/apache/arrow/pull/6067, it says it changed to do > > zero-copy for 1-column blocks if possible). > > I am not sure if something should be fixed in pyarrow for this, but the > > obvious thing that pyspark can do is specify they don't want zero-copy. > > > > Joris > > > > On Wed, 15 Jan 2020 at 14:32, Crossbow <cross...@ursalabs.org> wrote: > > >