jorisvandenbossche commented on code in PR #37822:
URL: https://github.com/apache/arrow/pull/37822#discussion_r1631252322
##########
python/pyarrow/lib.pyx:
##########
@@ -79,6 +79,15 @@ def set_cpu_count(int count):
check_status(SetCpuThreadPoolCapacity(count))
+def is_threading_enabled() -> bool:
+ """
+ Returns true if threading is enabled in libarrow. If it isn't enabled,
then python shouldn't
Review Comment:
```suggestion
Return True if threading is enabled in pyarrow.
If it isn't enabled, then python shouldn't
```
And a styling nitpick, but can you limit the line length a bit when you
reflow the second paragraph?
##########
python/setup.py:
##########
@@ -40,6 +40,12 @@
# Check if we're running 64-bit Python
is_64_bit = sys.maxsize > 2**32
+is_emscripten = (
+ sysconfig.get_config_var("SOABI")
+ and sysconfig.get_config_var("SOABI").find("emscripten") != -1
+)
Review Comment:
Is there a specific reason we use this check here, but `sys.platform ==
'emscripten':` in other places?
##########
python/scripts/run_emscripten_tests.py:
##########
Review Comment:
I haven't looked in detail at this file, but so certainly would second your
comment
> 2\. The script (in `python/scripts` which runs the tests is non-trivial;
at some point it would be worth upstreaming this code to somewhere pyodide
related, but I can't work out where right now. I'm talking to pyodide people
about it, but for now it needs to live in arrow still.
Just a question about this: is there a reason this is more complicated for
us? (I see that eg numpy is just calling pytest from the github action workflow)
##########
python/pyarrow/io.pxi:
##########
@@ -788,11 +847,28 @@ cdef class NativeFile(_Weakrefable):
write_queue.put_nowait(buf)
finally:
done = True
-
writer_thread.join()
if exc_info is not None:
raise exc_info[0], exc_info[1], exc_info[2]
+ def _upload_nothreads(self, stream, buffer_size=None):
+ """
+ Internal method to do an upload without separate threads, queues
etc.
+ Called by upload above if is_threading_enabled() == False
Review Comment:
```suggestion
Internal method to do an upload without separate threads, queues etc.
Called by upload above if is_threading_enabled() == False
```
##########
python/pyarrow/conftest.py:
##########
@@ -76,14 +80,21 @@
'pandas': False,
'parquet': False,
'parquet_encryption': False,
+ 'processes': True,
'requires_testing_data': True,
's3': False,
'slow': False,
'snappy': Codec.is_available('snappy'),
'substrait': False,
+ 'threading': is_threading_enabled(),
'zstd': Codec.is_available('zstd'),
}
+if sys.platform == "emscripten":
+ # emscripten doesn't support multiprocessing or gdb
Review Comment:
```suggestion
# emscripten doesn't support multiprocessing/subprocess or gdb
```
From seeing the tests that are marked with `processes`, it's more the use of
`subprocess` module than `multiprocessing`? (which in practice has the same
effect of spawning multiple processes, but I found the mention of
multiprocessing (thinking about the stdlib module) here confusing for a moment)
##########
python/pyarrow/error.pxi:
##########
@@ -217,7 +218,10 @@ cdef class SignalStopHandler:
maybe_source.status().Warn()
else:
self._stop_token.init(deref(maybe_source).token())
- self._enabled = True
+ if not is_threading_enabled():
+ self._enabled = False
Review Comment:
The signal handler (and so ctrl-c?) just doesn't work on emscripten, or this
is actually threading related?
(potentially dumb question, I am not familiar with this part of our code
base; this might be obvious, but it could maybe also use a clarifying code
comment)
##########
python/pyarrow/_emscripten_timezones.py:
##########
@@ -0,0 +1,48 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from importlib import resources
+from pathlib import Path
+from shutil import copytree
+
+try:
+ import js
+
+ def setup_emscripten_timezone_database():
+ """Setup the timezone database for use on browser based
+ emscripten environments.
+ """
+ if not Path("/usr/share/zoneinfo").exists():
+ Path("/usr/share/").mkdir(parents=True, exist_ok=True)
+ try:
+ tzpath = resources.files("tzdata").joinpath("zoneinfo")
+ copytree(tzpath, "/usr/share/zoneinfo", dirs_exist_ok=True)
+ localtime_path = Path("/etc/localtime")
+ if not localtime_path.exists():
+ # get local timezone from browser js object
+ timezone =
js.Intl.DateTimeFormat().resolvedOptions().timeZone
+ if timezone and str(timezone) != "":
+ timezone = str(timezone)
+ # make symbolic link to local time
+ Path("/etc/").mkdir(parents=True, exist_ok=True)
+ localtime_path.symlink_to(tzpath / timezone)
+ except (ImportError, IOError):
+ print("Arrow couldn't install timezone db to
/usr/share/zoneinfo")
Review Comment:
Should we maybe use a warning.warn instead of print?
##########
python/pyarrow/io.pxi:
##########
@@ -733,11 +734,66 @@ cdef class NativeFile(_Weakrefable):
finally:
free(buf)
done = True
-
writer_thread.join()
if exc_info is not None:
raise exc_info[0], exc_info[1], exc_info[2]
+ def _download_nothreads(self, stream_or_path, buffer_size=None):
+ """
+ Internal method to do a download without separate threads, queues
etc.
+ Called by download above if is_threading_enabled() == False
Review Comment:
```suggestion
Internal method to do a download without separate threads, queues
etc.
Called by download above if is_threading_enabled() == False
```
##########
python/pyarrow/tests/test_compute.py:
##########
@@ -2481,6 +2493,10 @@ def _check_temporal_rounding(ts, values, unit):
@pytest.mark.skipif(sys.platform == "win32" and not util.windows_has_tzdata(),
reason="Timezone database is not installed on Windows")
[email protected](
+ sys.platform == "emscripten",
+ reason="Emscripten datetime is implemented in Javascript and works
differently"
+)
Review Comment:
Out of curiosity, do you remember what exactly was failing here? Because
this test is rounding timestamps, which I would think is mostly implemented by
our own code and not relying on something platform specific
##########
python/pyarrow/tests/test_memory.py:
##########
@@ -118,6 +119,7 @@ def test_release_unused():
pool.release_unused()
[email protected]
Review Comment:
You added the file top-level mark, so those test specific ones can be
removed?
##########
python/pyarrow/tests/test_array.py:
##########
@@ -34,6 +34,9 @@
from pyarrow.vendored.version import Version
[email protected](
+ sys.platform == "emscripten", reason="Emscripten can't run processes"
+)
Review Comment:
You added a `processes` marker I saw in contest.py. Could/should that be
used here instead?
##########
ci/docker/conda-python-emscripten.dockerfile:
##########
Review Comment:
You are adding this docker file, but where is it used? Because the changes
in docker-compose.yml (still) seem to be based on an ubuntu image?
##########
python/pyarrow/tests/test_io.py:
##########
@@ -1073,6 +1073,8 @@ def _try_delete(path):
def test_memory_map_writer(tmpdir):
+ if sys.platform == "emscripten":
+ pytest.xfail("Multiple memory maps to same file don't work on
emscripten")
Review Comment:
Use a marker like in other places?
##########
python/pyarrow/tests/test_io.py:
##########
@@ -1073,6 +1073,8 @@ def _try_delete(path):
def test_memory_map_writer(tmpdir):
+ if sys.platform == "emscripten":
+ pytest.xfail("Multiple memory maps to same file don't work on
emscripten")
Review Comment:
And can maybe also be a skip instead of xfail (or are there prospects that
this would someday work?)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]