aimtsou commented on PR #40220:
URL: https://github.com/apache/spark/pull/40220#issuecomment-1450063606

   @HyukjinKwon: I grepped for all the deprecated types and I list my findings 
below, please let me know if you see something that should not be changed.
   
   For the deprecations introduced by numpy 1.24.0 and greping master branch as 
cloned yesterday:
   
   ```
   spark % git grep np.object0
   python/pyspark/sql/pandas/conversion.py:                                
np.object0 if pandas_type is None else pandas_type
   spark % git grep np.str0
   spark % git grep np.bytes0
   spark % git grep np.void0
   spark % git grep np.int0
   spark % git grep np.uint0
   spark % git grep np.bool8
   ```
   
   As we see we have only one np.object0 so we are pretty safe with these numpy 
changes.
   
   For the deprecations introduced by numpy 1.20.0 that resulted in removals in 
1.22.0 and greping master branch as cloned yesterday:
   
   ```
   spark % git grep np.float | grep -v np.float_ | grep -v np.float64 | grep -v 
np.float32 | grep -v np.float8 | grep -v np.float16
   mllib/src/test/scala/org/apache/spark/ml/feature/RobustScalerSuite.scala:    
  X = np.array([[0, 0], [1, -1], [2, -2], [3, -3], [4, -4]], dtype=np.float)
   python/docs/source/user_guide/pandas_on_spark/types.rst:np.float      
DoubleType
   python/pyspark/pandas/tests/indexes/test_base.py:        
self.assert_eq(psidx.astype(np.float), pidx.astype(np.float))
   python/pyspark/pandas/tests/test_series.py:        
self.assert_eq(psser.astype(np.float), pser.astype(np.float))
   python/pyspark/pandas/typedef/typehints.py:    >>> def func() -> 
ps.DataFrame[np.float, str]:
   python/pyspark/pandas/typedef/typehints.py:    >>> def func() -> 
ps.DataFrame[np.float]:
   python/pyspark/pandas/typedef/typehints.py:    >>> def func() -> 
'ps.DataFrame[np.float, str]':
   python/pyspark/pandas/typedef/typehints.py:    >>> def func() -> 
'ps.DataFrame[np.float]':
   python/pyspark/pandas/typedef/typehints.py:    >>> def func() -> 
ps.DataFrame['a': np.float, 'b': int]:
   python/pyspark/pandas/typedef/typehints.py:    >>> def func() -> 
"ps.DataFrame['a': np.float, 'b': int]":
   spark % git grep np.str | grep -v np.str_ | grep -v np.string_
   python/docs/source/user_guide/pandas_on_spark/types.rst:np.string\_   
BinaryType
   python/docs/source/user_guide/pandas_on_spark/types.rst:np.str        
StringType
   python/pyspark/pandas/tests/test_typedef.py:            np.str: 
(np.unicode_, StringType()),
   spark % git grep np.object | grep -v np.object_ 
   python/pyspark/sql/pandas/conversion.py:                                
np.object0 if pandas_type is None else pandas_type
   python/pyspark/sql/pandas/conversion.py:                
corrected_dtypes[index] = np.object  # type: ignore[attr-defined]
   python/pyspark/sql/tests/test_dataframe.py:        
self.assertEqual(types[1], np.object)
   python/pyspark/sql/tests/test_dataframe.py:        
self.assertEqual(types[4], np.object)  # datetime.date
   python/pyspark/sql/tests/test_dataframe.py:        
self.assertEqual(types[1], np.object)
   python/pyspark/sql/tests/test_dataframe.py:                
self.assertEqual(types[6], np.object)
   python/pyspark/sql/tests/test_dataframe.py:                
self.assertEqual(types[7], np.object)
   spark % git grep np.complex | grep -v np.complex_ 
   spark % git grep np.long  
   spark % git grep np.unicode | grep -v np.unicode_ 
   python/docs/source/user_guide/pandas_on_spark/types.rst:np.unicode\_  
StringType
   spark % git grep np.bool | grep -v np.bool_ 
   python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool       
BooleanType
   python/pyspark/pandas/tests/test_typedef.py:            np.bool: (np.bool, 
BooleanType()),
   python/pyspark/pandas/tests/test_typedef.py:            bool: (np.bool, 
BooleanType()),
   python/pyspark/sql/tests/test_dataframe.py:        
self.assertEqual(types[2], np.bool)
   spark % git grep np.int | grep -v np.int_ | grep -v np.int64 | grep -v 
np.int32 | grep -v np.int8 | grep -v np.int16
   
connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala:
      // sadly we can't pinpoint specific data and invalidate cause we don't 
have unique id
   
core/src/main/resources/org/apache/spark/ui/static/vis-timeline-graph2d.min.js:
   python/docs/source/user_guide/pandas_on_spark/types.rst:np.int        
LongType
   python/pyspark/mllib/regression.py:        return np.interp(x, 
self.boundaries, self.predictions)  # type: ignore[arg-type]
   python/pyspark/pandas/groupby.py:        >>> def plus_max(x) -> 
ps.Series[np.int]:
   python/pyspark/pandas/groupby.py:        >>> def plus_length(x) -> np.int:
   python/pyspark/pandas/groupby.py:        >>> def calculation(x, y, z) -> 
np.int:
   python/pyspark/pandas/groupby.py:        >>> def plus_max(x) -> 
ps.Series[np.int]:
   python/pyspark/pandas/groupby.py:        >>> def calculation(x, y, z) -> 
ps.Series[np.int]:
   python/pyspark/pandas/tests/indexes/test_base.py:        
self.assert_eq(psidx.astype(np.int), pidx.astype(np.int))
   python/pyspark/pandas/tests/test_series.py:        
self.assert_eq(psser.astype(np.int), pser.astype(np.int))
   ```
   
   As you can see the 2 most difficult types are the int and the float where I 
find even scala files and 1 js file. i will go through thoroughly the lines and 
let you know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to