Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

Bryan Cutler Thu, 23 Jan 2020 12:04:28 -0800

Thanks for investigating this and the quick fix Joris and Wes!  I just have
a couple questions about the behavior observed here.  The pyspark code
assigns either the same series back to the pandas.DataFrame or makes some
modifications if it is a timestamp. In the case there are no timestamps, is
this potentially making extra copies or will it be unable to take advantage
of new zero-copy features in pyarrow? For the case of having timestamp
columns that need to be modified, is there a more efficient way to create a
new dataframe with only copies of the modified series?  Thanks!


Bryan

On Thu, Jan 16, 2020 at 11:48 PM Joris Van den Bossche <
[email protected]> wrote:

> That sounds like a good solution. Having the zero-copy behavior depending
> on whether you have only 1 column of a certain type or not, might lead to
> surprising results. To avoid yet another keyword, only doing it when
> split_blocks=True sounds good to me (in practice, that's also when it will
> happen mostly, except for very narrow dataframes with only few columns).
>
> Joris
>
> On Thu, 16 Jan 2020 at 22:44, Wes McKinney <[email protected]> wrote:
>
> > hi Joris,
> >
> > Thanks for investigating this. It seems there were some unintended
> > consequences of the zero-copy optimizations from ARROW-3789. Another
> > way forward might be to "opt in" to this behavior, or to only do the
> > zero copy optimizations when split_blocks=True. What do you think?
> >
> > - Wes
> >
> > On Thu, Jan 16, 2020 at 3:42 AM Joris Van den Bossche
> > <[email protected]> wrote:
> > >
> > > So the spark integration build started to fail, and with the following
> > test
> > > error:
> > >
> > > ======================================================================
> > > ERROR: test_toPandas_batch_order
> > > (pyspark.sql.tests.test_arrow.EncryptionArrowTests)
> > > ----------------------------------------------------------------------
> > > Traceback (most recent call last):
> > >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 422, in
> > > test_toPandas_batch_order
> > >     run_test(*case)
> > >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 409, in
> > run_test
> > >     pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
> > >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 152, in
> > > _toPandas_arrow_toggle
> > >     pdf_arrow = df.toPandas()
> > >   File "/spark/python/pyspark/sql/pandas/conversion.py", line 115, in
> > toPandas
> > >     return _check_dataframe_localize_timestamps(pdf, timezone)
> > >   File "/spark/python/pyspark/sql/pandas/types.py", line 180, in
> > > _check_dataframe_localize_timestamps
> > >     pdf[column] = _check_series_localize_timestamps(series, timezone)
> > >   File
> > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> > > line 3487, in __setitem__
> > >     self._set_item(key, value)
> > >   File
> > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> > > line 3565, in _set_item
> > >     NDFrame._set_item(self, key, value)
> > >   File
> >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/generic.py",
> > > line 3381, in _set_item
> > >     self._data.set(key, value)
> > >   File
> >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/managers.py",
> > > line 1090, in set
> > >     blk.set(blk_locs, value_getitem(val_locs))
> > >   File
> >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/blocks.py",
> > > line 380, in set
> > >     self.values[locs] = values
> > > ValueError: assignment destination is read-only
> > >
> > >
> > > It's from a test that is doing conversions from spark to arrow to
> pandas
> > > (so calling pyarrow.Table.to_pandas here
> > > <
> >
> https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/conversion.py#L111-L115
> > >),
> > > and on the resulting DataFrame, it is iterating through all columns,
> > > potentially fixing timezones, and writing each column back into the
> > > DataFrame (here
> > > <
> >
> https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/types.py#L179-L181
> > >
> > > ).
> > >
> > > Since it is giving an error about read-only, it might be related to
> > > zero-copy behaviour of to_pandas, and thus might be related to the
> > refactor
> > > of the arrow->pandas conversion that landed yesterday (
> > > https://github.com/apache/arrow/pull/6067, it says it changed to do
> > > zero-copy for 1-column blocks if possible).
> > > I am not sure if something should be fixed in pyarrow for this, but the
> > > obvious thing that pyspark can do is specify they don't want zero-copy.
> > >
> > > Joris
> > >
> > > On Wed, 15 Jan 2020 at 14:32, Crossbow <[email protected]> wrote:
> > >
> >
>

Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

Reply via email to