aimtsou opened a new pull request, #40220:
URL: https://github.com/apache/spark/pull/40220

   ### Problem description
   Numpy has started changing the alias to some of its data-types. This means 
that users with the latest version of numpy they will face either warnings or 
errors according to the type that they are using. This affects all the users 
using numoy > 1.20.0
   One of the types was fixed back in September with this 
[pull](https://github.com/apache/spark/pull/37817) request
   
   [numpy 1.24.0](https://github.com/numpy/numpy/pull/22607): The scalar type 
aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, 
np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be 
removed.
   [numpy 1.20.0](https://github.com/numpy/numpy/pull/14882): Using the aliases 
of builtin types like np.int is deprecated
   
   ### What changes were proposed in this pull request?
   From numpy 1.20.0 we receive a deprecattion warning on 
np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and 
from numpy 1.22.0 we received an attribute error:
   
   ```
   attr = 'object'
   
       def __getattr__(attr):
           # Warn for expired attributes, and return a dummy function
           # that always raises an exception.
           import warnings
           try:
               msg = __expired_functions__[attr]
           except KeyError:
               pass
           else:
               warnings.warn(msg, DeprecationWarning, stacklevel=2)
       
               def _expired(*args, **kwds):
                   raise RuntimeError(msg)
       
               return _expired
       
           # Emit warnings for deprecated attributes
           try:
               val, msg = __deprecated_attrs__[attr]
           except KeyError:
               pass
           else:
               warnings.warn(msg, DeprecationWarning, stacklevel=2)
               return val
       
           if attr in __future_scalars__:
               # And future warnings for those that will change, but also give
               # the AttributeError
               warnings.warn(
                   f"In the future `np.{attr}` will be defined as the "
                   "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
       
           if attr in __former_attrs__:
   >           raise AttributeError(__former_attrs__[attr])
   E           AttributeError: module 'numpy' has no attribute 'object'.
   E           `np.object` was a deprecated alias for the builtin `object`. To 
avoid this error in existing code, use `object` by itself. Doing this will not 
modify any behavior and is safe. 
   E           The aliases was originally deprecated in NumPy 1.20; for more 
details and guidance see the original release note at:
   E               
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
   ```
   
   From numpy version 1.24.0 we receive a deprecation warning on np.object0 and 
every np.datatype0 and np.bool8
   >>> np.object0(123)
   <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for 
``np.object0` is a deprecated alias for `np.object_`. `object` can be used 
instead.  (Deprecated NumPy 1.24)`.  (Deprecated NumPy 1.24)
   
   ### Why are the changes needed?
   The changes are needed so pyspark can be compatible with the latest numpy 
and avoid 
   
   - attribute errors on data types being deprecated from version 1.20.0: 
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
   - warnings on deprecated data types from version 1.24.0: 
https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations
   
   
   ### Does this PR introduce _any_ user-facing change?
   The change will suppress the warning coming from numpy 1.24.0 and the error 
coming from numpy 1.22.0
   
   ### How was this patch tested?
   I found this to be a problem in my work's project where we use for our unit 
tests the toPandas() function to convert to np.object. Attaching the run result 
of our test:
   
   ```
   
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _
   /usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in 
run_testcase
       self.handler.compare_df(result, expected, config=self.compare_config)
   /usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in 
compare_df
       actual_pd = actual.toPandas().sort_values(by=sort_columns, 
ignore_index=True)
   /usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: 
in toPandas
       corrected_dtypes[index] = np.object  # type: ignore[attr-defined]
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _
   
   attr = 'object'
   
       def __getattr__(attr):
           # Warn for expired attributes, and return a dummy function
           # that always raises an exception.
           import warnings
           try:
               msg = __expired_functions__[attr]
           except KeyError:
               pass
           else:
               warnings.warn(msg, DeprecationWarning, stacklevel=2)
       
               def _expired(*args, **kwds):
                   raise RuntimeError(msg)
       
               return _expired
       
           # Emit warnings for deprecated attributes
           try:
               val, msg = __deprecated_attrs__[attr]
           except KeyError:
               pass
           else:
               warnings.warn(msg, DeprecationWarning, stacklevel=2)
               return val
       
           if attr in __future_scalars__:
               # And future warnings for those that will change, but also give
               # the AttributeError
               warnings.warn(
                   f"In the future `np.{attr}` will be defined as the "
                   "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
       
           if attr in __former_attrs__:
   >           raise AttributeError(__former_attrs__[attr])
   E           AttributeError: module 'numpy' has no attribute 'object'.
   E           `np.object` was a deprecated alias for the builtin `object`. To 
avoid this error in existing code, use `object` by itself. Doing this will not 
modify any behavior and is safe. 
   E           The aliases was originally deprecated in NumPy 1.20; for more 
details and guidance see the original release note at:
   E               
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
   
   /usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError
   ```
   
   Although i cannot provide the code doing in python the following should show 
the problem:
   ```
   >>> import numpy as np
   >>> np.object0(123)
   <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for 
``np.object0` is a deprecated alias for `np.object_`. `object` can be used 
instead.  (Deprecated NumPy 1.24)`.  (Deprecated NumPy 1.24)
   123
   >>> np.object(123)
   <stdin>:1: FutureWarning: In the future `np.object` will be defined as the 
corresponding NumPy scalar.
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, 
in __getattr__
       raise AttributeError(__former_attrs__[attr])
   AttributeError: module 'numpy' has no attribute 'object'.
   `np.object` was a deprecated alias for the builtin `object`. To avoid this 
error in existing code, use `object` by itself. Doing this will not modify any 
behavior and is safe. 
   The aliases was originally deprecated in NumPy 1.20; for more details and 
guidance see the original release note at:
       https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
   ```
   
   I do not have a use-case in my tests for np.object0 but I fixed like the 
suggestion from numpy
   
   ### Supported Versions:
   I propose this fix to be included in all pyspark 3.3 and onwards
   
   ### JIRA
   I know a JIRA ticket should be created I sent an email and I am waiting for 
the answer to document the case also there.
   
   ### Extra questions:
   By grepping for np.bool and np.object I see that the tests include them. 
Shall we change them also? Data types with _ I think they are not affected.
   
   git grep np.object
   python/pyspark/ml/functions.py:        return data.dtype == np.object_ and 
isinstance(data.iloc[0], (np.ndarray, list))
   python/pyspark/ml/functions.py:        return any(data.dtypes == np.object_) 
and any(
   python/pyspark/sql/tests/test_dataframe.py:        
self.assertEqual(types[1], np.object)
   python/pyspark/sql/tests/test_dataframe.py:        
self.assertEqual(types[4], np.object)  # datetime.date
   python/pyspark/sql/tests/test_dataframe.py:        
self.assertEqual(types[1], np.object)
   python/pyspark/sql/tests/test_dataframe.py:                
self.assertEqual(types[6], np.object)
   python/pyspark/sql/tests/test_dataframe.py:                
self.assertEqual(types[7], np.object)
   
   git grep np.bool
   python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool       
BooleanType
   python/pyspark/pandas/indexing.py:            isinstance(key, np.bool_) for 
key in cols_sel
   python/pyspark/pandas/tests/test_typedef.py:            np.bool: (np.bool, 
BooleanType()),
   python/pyspark/pandas/tests/test_typedef.py:            bool: (np.bool, 
BooleanType()),
   python/pyspark/pandas/typedef/typehints.py:    elif tpe in (bool, np.bool_, 
"bool", "?"):
   python/pyspark/sql/connect/expressions.py:                assert 
isinstance(value, (bool, np.bool_))
   python/pyspark/sql/connect/expressions.py:                elif 
isinstance(value, np.bool_):
   python/pyspark/sql/tests/test_dataframe.py:        
self.assertEqual(types[2], np.bool)
   python/pyspark/sql/tests/test_functions.py:            (np.bool_, [("true", 
"boolean")]),
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to