Re: [PR] GH-41910: [Python] Add support for Pyodide [arrow]

via GitHub Fri, 07 Jun 2024 07:27:40 -0700


jorisvandenbossche commented on code in PR #37822:
URL: https://github.com/apache/arrow/pull/37822#discussion_r1631252322



##########
python/pyarrow/lib.pyx:
##########
@@ -79,6 +79,15 @@ def set_cpu_count(int count):
     check_status(SetCpuThreadPoolCapacity(count))
 
 
+def is_threading_enabled() -> bool:
+    """
+    Returns true if threading is enabled in libarrow. If it isn't enabled, 
then python shouldn't

Review Comment:
   ```suggestion
       Return True if threading is enabled in pyarrow.
   
       If it isn't enabled, then python shouldn't
   ```
   
   And a styling nitpick, but can you limit the line length a bit when you 
reflow the second paragraph?



##########
python/setup.py:
##########
@@ -40,6 +40,12 @@
 # Check if we're running 64-bit Python
 is_64_bit = sys.maxsize > 2**32
 
+is_emscripten = (
+    sysconfig.get_config_var("SOABI")
+    and sysconfig.get_config_var("SOABI").find("emscripten") != -1
+)

Review Comment:
   Is there a specific reason we use this check here, but `sys.platform == 
'emscripten':` in other places?



##########
python/scripts/run_emscripten_tests.py:
##########


Review Comment:
   I haven't looked in detail at this file, but so certainly would second your 
comment
   
   > 2\. The script (in `python/scripts` which runs the tests is non-trivial; 
at some point it would be worth upstreaming this code to somewhere pyodide 
related, but I can't work out where right now. I'm talking to pyodide people 
about it, but for now it needs to live in arrow still.
   
   Just a question about this: is there a reason this is more complicated for 
us? (I see that eg numpy is just calling pytest from the github action workflow)
   



##########
python/pyarrow/io.pxi:
##########
@@ -788,11 +847,28 @@ cdef class NativeFile(_Weakrefable):
                 write_queue.put_nowait(buf)
         finally:
             done = True
-
         writer_thread.join()
         if exc_info is not None:
             raise exc_info[0], exc_info[1], exc_info[2]
 
+    def _upload_nothreads(self, stream, buffer_size=None):
+        """
+            Internal method to do an upload without separate threads, queues 
etc.
+            Called by upload above if is_threading_enabled() == False

Review Comment:
   ```suggestion
           Internal method to do an upload without separate threads, queues etc.
           Called by upload above if is_threading_enabled() == False
   ```



##########
python/pyarrow/conftest.py:
##########
@@ -76,14 +80,21 @@
     'pandas': False,
     'parquet': False,
     'parquet_encryption': False,
+    'processes': True,
     'requires_testing_data': True,
     's3': False,
     'slow': False,
     'snappy': Codec.is_available('snappy'),
     'substrait': False,
+    'threading': is_threading_enabled(),
     'zstd': Codec.is_available('zstd'),
 }
 
+if sys.platform == "emscripten":
+    # emscripten doesn't support multiprocessing or gdb

Review Comment:
   ```suggestion
       # emscripten doesn't support multiprocessing/subprocess or gdb
   ```
   
   From seeing the tests that are marked with `processes`, it's more the use of 
`subprocess` module than `multiprocessing`? (which in practice has the same 
effect of spawning multiple processes, but I found the mention of 
multiprocessing (thinking about the stdlib module) here confusing for a moment)



##########
python/pyarrow/error.pxi:
##########
@@ -217,7 +218,10 @@ cdef class SignalStopHandler:
                 maybe_source.status().Warn()
             else:
                 self._stop_token.init(deref(maybe_source).token())
-                self._enabled = True
+                if not is_threading_enabled():
+                    self._enabled = False

Review Comment:
   The signal handler (and so ctrl-c?) just doesn't work on emscripten, or this 
is actually threading related?
   
   (potentially dumb question, I am not familiar with this part of our code 
base; this might be obvious, but it could maybe also use a clarifying code 
comment)



##########
python/pyarrow/_emscripten_timezones.py:
##########
@@ -0,0 +1,48 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from importlib import resources
+from pathlib import Path
+from shutil import copytree
+
+try:
+    import js
+
+    def setup_emscripten_timezone_database():
+        """Setup the timezone database for use on browser based
+        emscripten environments.
+        """
+        if not Path("/usr/share/zoneinfo").exists():
+            Path("/usr/share/").mkdir(parents=True, exist_ok=True)
+            try:
+                tzpath = resources.files("tzdata").joinpath("zoneinfo")
+                copytree(tzpath, "/usr/share/zoneinfo", dirs_exist_ok=True)
+                localtime_path = Path("/etc/localtime")
+                if not localtime_path.exists():
+                    # get local timezone from browser js object
+                    timezone = 
js.Intl.DateTimeFormat().resolvedOptions().timeZone
+                    if timezone and str(timezone) != "":
+                        timezone = str(timezone)
+                        # make symbolic link to local time
+                        Path("/etc/").mkdir(parents=True, exist_ok=True)
+                        localtime_path.symlink_to(tzpath / timezone)
+            except (ImportError, IOError):
+                print("Arrow couldn't install timezone db to 
/usr/share/zoneinfo")

Review Comment:
   Should we maybe use a warning.warn instead of print?



##########
python/pyarrow/io.pxi:
##########
@@ -733,11 +734,66 @@ cdef class NativeFile(_Weakrefable):
         finally:
             free(buf)
             done = True
-
         writer_thread.join()
         if exc_info is not None:
             raise exc_info[0], exc_info[1], exc_info[2]
 
+    def _download_nothreads(self, stream_or_path, buffer_size=None):
+        """
+            Internal method to do a download without separate threads, queues 
etc.
+            Called by download above if is_threading_enabled() == False

Review Comment:
   ```suggestion
           Internal method to do a download without separate threads, queues 
etc.
           Called by download above if is_threading_enabled() == False
   ```



##########
python/pyarrow/tests/test_compute.py:
##########
@@ -2481,6 +2493,10 @@ def _check_temporal_rounding(ts, values, unit):
 
 @pytest.mark.skipif(sys.platform == "win32" and not util.windows_has_tzdata(),
                     reason="Timezone database is not installed on Windows")
[email protected](
+    sys.platform == "emscripten",
+    reason="Emscripten datetime is implemented in Javascript and works 
differently"
+)

Review Comment:
   Out of curiosity, do you remember what exactly was failing here? Because 
this test is rounding timestamps, which I would think is mostly implemented by 
our own code and not relying on something platform specific



##########
python/pyarrow/tests/test_memory.py:
##########
@@ -118,6 +119,7 @@ def test_release_unused():
     pool.release_unused()
 
 
[email protected]

Review Comment:
   You added the file top-level mark, so those test specific ones can be 
removed?



##########
python/pyarrow/tests/test_array.py:
##########
@@ -34,6 +34,9 @@
 from pyarrow.vendored.version import Version
 
 
[email protected](
+    sys.platform == "emscripten", reason="Emscripten can't run processes"
+)

Review Comment:
   You added a `processes` marker I saw in contest.py. Could/should that be 
used here instead?



##########
ci/docker/conda-python-emscripten.dockerfile:
##########


Review Comment:
   You are adding this docker file, but where is it used? Because the changes 
in docker-compose.yml (still) seem to be based on an ubuntu image? 



##########
python/pyarrow/tests/test_io.py:
##########
@@ -1073,6 +1073,8 @@ def _try_delete(path):
 
 
 def test_memory_map_writer(tmpdir):
+    if sys.platform == "emscripten":
+        pytest.xfail("Multiple memory maps to same file don't work on 
emscripten")

Review Comment:
   Use a marker like in other places?



##########
python/pyarrow/tests/test_io.py:
##########
@@ -1073,6 +1073,8 @@ def _try_delete(path):
 
 
 def test_memory_map_writer(tmpdir):
+    if sys.platform == "emscripten":
+        pytest.xfail("Multiple memory maps to same file don't work on 
emscripten")

Review Comment:
   And can maybe also be a skip instead of xfail (or are there prospects that 
this would someday work?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-41910: [Python] Add support for Pyodide [arrow]

Reply via email to