aimtsou commented on PR #40220:
URL: https://github.com/apache/spark/pull/40220#issuecomment-1450063606
@HyukjinKwon: I grepped for all the deprecated types and I list my findings
below, please let me know if you see something that should not be changed.
For the deprecations introduced by numpy 1.24.0 and greping master branch as
cloned yesterday:
```
spark % git grep np.object0
python/pyspark/sql/pandas/conversion.py:
np.object0 if pandas_type is None else pandas_type
spark % git grep np.str0
spark % git grep np.bytes0
spark % git grep np.void0
spark % git grep np.int0
spark % git grep np.uint0
spark % git grep np.bool8
```
As we see we have only one np.object0 so we are pretty safe with these numpy
changes.
For the deprecations introduced by numpy 1.20.0 that resulted in removals in
1.22.0 and greping master branch as cloned yesterday:
```
spark % git grep np.float | grep -v np.float_ | grep -v np.float64 | grep -v
np.float32 | grep -v np.float8 | grep -v np.float16
mllib/src/test/scala/org/apache/spark/ml/feature/RobustScalerSuite.scala:
X = np.array([[0, 0], [1, -1], [2, -2], [3, -3], [4, -4]], dtype=np.float)
python/docs/source/user_guide/pandas_on_spark/types.rst:np.float
DoubleType
python/pyspark/pandas/tests/indexes/test_base.py:
self.assert_eq(psidx.astype(np.float), pidx.astype(np.float))
python/pyspark/pandas/tests/test_series.py:
self.assert_eq(psser.astype(np.float), pser.astype(np.float))
python/pyspark/pandas/typedef/typehints.py: >>> def func() ->
ps.DataFrame[np.float, str]:
python/pyspark/pandas/typedef/typehints.py: >>> def func() ->
ps.DataFrame[np.float]:
python/pyspark/pandas/typedef/typehints.py: >>> def func() ->
'ps.DataFrame[np.float, str]':
python/pyspark/pandas/typedef/typehints.py: >>> def func() ->
'ps.DataFrame[np.float]':
python/pyspark/pandas/typedef/typehints.py: >>> def func() ->
ps.DataFrame['a': np.float, 'b': int]:
python/pyspark/pandas/typedef/typehints.py: >>> def func() ->
"ps.DataFrame['a': np.float, 'b': int]":
spark % git grep np.str | grep -v np.str_ | grep -v np.string_
python/docs/source/user_guide/pandas_on_spark/types.rst:np.string\_
BinaryType
python/docs/source/user_guide/pandas_on_spark/types.rst:np.str
StringType
python/pyspark/pandas/tests/test_typedef.py: np.str:
(np.unicode_, StringType()),
spark % git grep np.object | grep -v np.object_
python/pyspark/sql/pandas/conversion.py:
np.object0 if pandas_type is None else pandas_type
python/pyspark/sql/pandas/conversion.py:
corrected_dtypes[index] = np.object # type: ignore[attr-defined]
python/pyspark/sql/tests/test_dataframe.py:
self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:
self.assertEqual(types[4], np.object) # datetime.date
python/pyspark/sql/tests/test_dataframe.py:
self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:
self.assertEqual(types[6], np.object)
python/pyspark/sql/tests/test_dataframe.py:
self.assertEqual(types[7], np.object)
spark % git grep np.complex | grep -v np.complex_
spark % git grep np.long
spark % git grep np.unicode | grep -v np.unicode_
python/docs/source/user_guide/pandas_on_spark/types.rst:np.unicode\_
StringType
spark % git grep np.bool | grep -v np.bool_
python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool
BooleanType
python/pyspark/pandas/tests/test_typedef.py: np.bool: (np.bool,
BooleanType()),
python/pyspark/pandas/tests/test_typedef.py: bool: (np.bool,
BooleanType()),
python/pyspark/sql/tests/test_dataframe.py:
self.assertEqual(types[2], np.bool)
spark % git grep np.int | grep -v np.int_ | grep -v np.int64 | grep -v
np.int32 | grep -v np.int8 | grep -v np.int16
connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala:
// sadly we can't pinpoint specific data and invalidate cause we don't
have unique id
core/src/main/resources/org/apache/spark/ui/static/vis-timeline-graph2d.min.js:
python/docs/source/user_guide/pandas_on_spark/types.rst:np.int
LongType
python/pyspark/mllib/regression.py: return np.interp(x,
self.boundaries, self.predictions) # type: ignore[arg-type]
python/pyspark/pandas/groupby.py: >>> def plus_max(x) ->
ps.Series[np.int]:
python/pyspark/pandas/groupby.py: >>> def plus_length(x) -> np.int:
python/pyspark/pandas/groupby.py: >>> def calculation(x, y, z) ->
np.int:
python/pyspark/pandas/groupby.py: >>> def plus_max(x) ->
ps.Series[np.int]:
python/pyspark/pandas/groupby.py: >>> def calculation(x, y, z) ->
ps.Series[np.int]:
python/pyspark/pandas/tests/indexes/test_base.py:
self.assert_eq(psidx.astype(np.int), pidx.astype(np.int))
python/pyspark/pandas/tests/test_series.py:
self.assert_eq(psser.astype(np.int), pser.astype(np.int))
```
As you can see the 2 most difficult types are the int and the float where I
find even scala files and 1 js file. i will go through thoroughly the lines and
let you know.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]