Re: [PR] GH-25118: [Python] Make NumPy an optional runtime dependency [arrow]

via GitHub Tue, 30 Jul 2024 07:32:20 -0700


jorisvandenbossche commented on code in PR #41904:
URL: https://github.com/apache/arrow/pull/41904#discussion_r1696672592



##########
docker-compose.yml:
##########
@@ -1254,6 +1255,37 @@ services:
     volumes: *conda-volumes
     command: *python-conda-command
 
+  conda-python-no-numpy:
+    # Usage:
+    #   docker-compose build conda
+    #   docker-compose build conda-cpp
+    #   docker-compose build conda-python
+    #   docker-compose build conda-python-no-numpy
+    #   docker-compose run --rm conda-python-no-numpy
+    image: ${REPO}:${ARCH}-conda-python-${PYTHON}-no-numpy
+    build:
+      context: .
+      dockerfile: ci/docker/conda-python-pandas.dockerfile
+      cache_from:
+        - ${REPO}:${ARCH}-conda-python-${PYTHON}-pandas-${PANDAS}
+      args:
+        repo: ${REPO}
+        arch: ${ARCH}
+        python: ${PYTHON}
+    shm_size: *shm-size
+    environment:
+      <<: [*common, *ccache, *sccache]
+      PARQUET_REQUIRE_ENCRYPTION:  # inherit
+      HYPOTHESIS_PROFILE:  # inherit
+      PYARROW_TEST_HYPOTHESIS:  # inherit
+    volumes: *conda-volumes
+    command:
+      ["
+        /arrow/ci/scripts/cpp_build.sh /arrow /build &&
+        /arrow/ci/scripts/python_build.sh /arrow /build &&
+        pip uninstall numpy -y &&

Review Comment:
   Use `mamba uninstall -y numpy` instead?



##########
docker-compose.yml:
##########
@@ -1254,6 +1255,37 @@ services:
     volumes: *conda-volumes
     command: *python-conda-command
 
+  conda-python-no-numpy:
+    # Usage:
+    #   docker-compose build conda
+    #   docker-compose build conda-cpp
+    #   docker-compose build conda-python
+    #   docker-compose build conda-python-no-numpy
+    #   docker-compose run --rm conda-python-no-numpy
+    image: ${REPO}:${ARCH}-conda-python-${PYTHON}-no-numpy
+    build:
+      context: .
+      dockerfile: ci/docker/conda-python-pandas.dockerfile

Review Comment:
   Could this start from `conda-python` instead of `conda-python-pandas`? (not 
really a need here to install pandas, since that requires numpy)



##########
python/pyarrow/array.pxi:
##########
@@ -1608,6 +1610,9 @@ cdef class Array(_PandasConvertible):
         """
         self._assert_cpu()
 
+        if np is None:
+            raise ImportError(
+                "Cannot return a numpy.ndarray if Numpy is not present")

Review Comment:
   ```suggestion
                   "Cannot return a numpy.ndarray if NumPy is not present")
   ```
   
   (I think it is either all lower case to refer to the code package import 
name, or the above capitalization to refer to the project)
   
   And same for the other cases where this is error is raised.



##########
python/pyarrow/tests/parquet/test_basic.py:
##########
@@ -22,7 +22,10 @@
 from shutil import copytree
 from decimal import Decimal
 
-import numpy as np
+try:
+    import numpy as np
+except ImportError:
+    np = None

Review Comment:
   Could also move this try/except into the try/except for pandas below, 
because numpy is only used for tests that are skipped because of a pandas mark



##########
python/pyarrow/tests/parquet/test_data_types.py:
##########
@@ -331,6 +335,7 @@ def test_column_of_lists(tempdir):
     tm.assert_frame_equal(df, df_read)
 
 
[email protected]

Review Comment:
   To what extent do we want (in the future, not for this PR) update such tests 
as this one that just makes use of numpy for some random numbers? (which could 
quite easily be replaced with stdlib `random`)



##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -579,6 +582,7 @@ def test_write_metadata(tempdir):
         )
 
 
[email protected]
 def test_table_large_metadata():

Review Comment:
   Use `range(10)` instead of `np.arange(10)` just below?



##########
python/pyarrow/tests/strategies.py:
##########
@@ -276,7 +279,7 @@ def arrays(draw, type, size=None, nullable=True):
         values = draw(npst.arrays(ty.to_pandas_dtype(), shape=(size,)))
         # Workaround ARROW-4952: no easy way to assert array equality
         # in a NaN-tolerant way.
-        values[np.isnan(values)] = -42.0
+        values[math.isnan(values)] = -42.0

Review Comment:
   Does this work? I would expect that `values` is a numpy array, and then 
`math.isnan` might not work?
   
   (need to trigger the hypothesis crossbow build)



##########
python/pyarrow/tests/test_array.py:
##########


Review Comment:
   (I think it still makes sense to move some of the tests (that are explicitly 
about conversion to/from numpy) to a separate file, but we can note that as a 
follow-up issue?)



##########
python/pyarrow/tests/test_compute.py:
##########
@@ -1651,6 +1665,7 @@ def test_round_binary():
         5.0, scale, round_mode="half_towards_infinity") == expect_inf
 
 
[email protected]
 def test_is_null():

Review Comment:
   Could change the `np.nan` to `float("nan")` to remove the numpy mark



##########
python/pyarrow/tests/test_io.py:
##########
@@ -1106,11 +1119,13 @@ def _check_native_file_reader(FACTORY, sample_data,
     assert f.tell() == ex_length
 
 
[email protected]
 def test_memory_map_reader(sample_disk_data):

Review Comment:
   Another to do item for a follow-up to rewrite `sample_disk_data` to not use 
numpy



##########
python/pyarrow/tests/test_ipc.py:
##########
@@ -161,14 +164,17 @@ def test_empty_file():
         pa.ipc.open_file(pa.BufferReader(buf))
 
 
[email protected]
 def test_file_simple_roundtrip(file_fixture):

Review Comment:
   Another to do: update `IpcFixture.write_batches` to not use numpy (that 
would avoid a lot of marks in this file, I think)



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -4594,6 +4604,7 @@ def test_write_table_multiple_fragments(tempdir):
     )
 
 
[email protected]
 def test_write_iterable(tempdir):
     table = pa.table([
         pa.array(range(20)), pa.array(np.random.randn(20)),

Review Comment:
   Some of those `np.random.randn(20)` calls could be replaced with an utility 
using the stdlib? (can also be for a follow-up)



##########
python/pyarrow/tests/test_cpp_internals.py:
##########
@@ -26,6 +31,10 @@ def inject_cpp_tests(ns):
     Inject C++ tests as Python functions into namespace `ns` (a dict).
     """
     for case in get_cpp_tests():
+        if np is None and ('numpy' in case.name or 'pandas' in case.name):
+            # Skip C++ tests that require pandas or numpy if numpy not present
+            continue

Review Comment:
   Could we add a mark to the `def wrapper(..)` function in those cases 
instead? (not sure if that will work with the way this is setup with injecting 
it in the namespace). 
   But that might ensure that at least you still see them in the list of tests 
when running pytest, but just being skipped (for developing / running those 
tests that might be easier to understand)



##########
python/pyarrow/tests/test_compute.py:
##########
@@ -1561,6 +1572,7 @@ def test_arithmetic_multiply():
     assert result.equals(expected)
 
 
[email protected]
 @pytest.mark.parametrize("ty", ["round", "round_to_multiple"])
 def test_round_to_integer(ty):

Review Comment:
   The only usage of numpy I see is `np.testing.assert_array_equal`, but since 
that is comparing pyarrow arrays, shouldn't that just be a normal pyarrow 
equals call?



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1153,6 +1156,7 @@ def test_partitioned_dataset(tempdir):
     pq.write_table(table, path / "output.parquet")
 
 
[email protected]
 def test_dataset_read_dictionary(tempdir):
     path = tempdir / "ARROW-3325-dataset"
     t1 = pa.table([[util.rands(10) for i in range(5)] * 10], names=['f0'])

Review Comment:
   Related to my comment just above, I assume we can quite easily rewrite 
`util.rands()` slightly to not use numpy.



##########
python/pyarrow/tests/test_compute.py:
##########
@@ -1153,6 +1161,7 @@ def test_take_indices_types():
             arr.take(indices)
 
 
[email protected]
 def test_take_on_chunked_array():

Review Comment:
   Change `indices = np.array([0, 5, 1, 6, 9, 2])` to `indices = pa.array([0, 
5, 1, 6, 9, 2])` instead ?



##########
python/pyarrow/tests/test_convert_builtin.py:
##########
@@ -23,8 +23,12 @@
 import re
 
 import hypothesis as h
-import numpy as np
 import pytest
+try:
+    import numpy as np
+except ImportError:
+    pytest.skip(reason="Failures on test collection due to numpy NOT enabled",
+                allow_module_level=True)

Review Comment:
   Given that this file is testing the plain python<->arrow conversion (not 
using the numpy code path), I think it would be good to try to use explicit 
markers for the subset of functions that do use numpy (didn't look in detail at 
the file though, to check how feasible this would be, and can certainly also be 
done in a follow-up)



##########
python/pyarrow/tests/test_compute.py:
##########
@@ -87,6 +69,30 @@
 ]
 
 
[email protected]
+def all_array_types():
+    return [
+        ('bool', [True, False, False, True, True]),
+        ('uint8', np.arange(5)),

Review Comment:
   I think it is fine to change this in a fixture anyway, but it would be quite 
easy to change those `np.arange` calls to not have this fixture require numpy



##########
python/pyarrow/tests/test_csv.py:
##########
@@ -415,6 +418,7 @@ def test_header_skip_rows(self):
             "kl": [],
         }
 
+    @pytest.mark.numpy
     def test_skip_rows_after_names(self):

Review Comment:
   As mentioned earlier in the discussion, I think `make_random_csv` can 
relatively easily be changed to not use numpy, then AFAICT this file would not 
actually use numpy at all



##########
python/pyarrow/tests/parquet/test_basic.py:
##########
@@ -22,7 +22,10 @@
 from shutil import copytree
 from decimal import Decimal
 
-import numpy as np
+try:
+    import numpy as np
+except ImportError:
+    np = None

Review Comment:
   (although then in the future if someone would add an actual 
`pytest.mark.numpy` marked test, that wouldn't work ...) 
   Let's leave as is then, but could still move the try/except below just to 
group those different try/except blocks together



##########
python/pyarrow/pandas_compat.py:
##########
@@ -85,27 +89,32 @@ def get_logical_type(arrow_type):
         return 'object'
 
 
-_numpy_logical_type_map = {
-    np.bool_: 'bool',
-    np.int8: 'int8',
-    np.int16: 'int16',
-    np.int32: 'int32',
-    np.int64: 'int64',
-    np.uint8: 'uint8',
-    np.uint16: 'uint16',
-    np.uint32: 'uint32',
-    np.uint64: 'uint64',
-    np.float32: 'float32',
-    np.float64: 'float64',
-    'datetime64[D]': 'date',
-    np.str_: 'string',
-    np.bytes_: 'bytes',
-}
+def get_numpy_logical_type_map():
+    global _numpy_logical_type_map
+    if not _numpy_logical_type_map:
+        _numpy_logical_type_map.update({
+            np.bool_: 'bool',
+            np.int8: 'int8',
+            np.int16: 'int16',
+            np.int32: 'int32',
+            np.int64: 'int64',
+            np.uint8: 'uint8',
+            np.uint16: 'uint16',
+            np.uint32: 'uint32',
+            np.uint64: 'uint64',
+            np.float32: 'float32',
+            np.float64: 'float64',
+            'datetime64[D]': 'date',
+            np.str_: 'string',
+            np.bytes_: 'bytes',
+        })
+    return _numpy_logical_type_map
 
 
 def get_logical_type_from_numpy(pandas_collection):
+    numpy_logical_type_map = get_numpy_logical_type_map()
     try:
-        return _numpy_logical_type_map[pandas_collection.dtype.type]
+        return numpy_logical_type_map[pandas_collection.dtype.type]

Review Comment:
   I think all functions in this file essentially assume pandas and numpy are 
available without proper error checking, because this code is specifically for 
converting to/from pandas objects and so you know pandas and numpy are 
available when this gets called. 
   
   The functions from this file that actually get called from elsewhere in the 
pyarrow code should maybe have a better error message. But those two helpers 
are only used here.



##########
python/pyarrow/tests/test_table.py:
##########
@@ -972,65 +977,67 @@ def check_tensors(tensor, expected_tensor, type, size):
     assert tensor.strides == expected_tensor.strides
 
 
[email protected]('typ', [
-    np.uint8, np.uint16, np.uint32, np.uint64,
-    np.int8, np.int16, np.int32, np.int64,
-    np.float32, np.float64,
-])

Review Comment:
   If we would want to keep this parametrization, could easily use strings 
instead (and then inside the function convert that with `np.dtype(..)` or so). 
   But no strong opinion, iterating over the options inside the test is also 
fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-25118: [Python] Make NumPy an optional runtime dependency [arrow]

Reply via email to