jorisvandenbossche commented on code in PR #13821:
URL: https://github.com/apache/arrow/pull/13821#discussion_r942413069
##########
python/pyarrow/tests/parquet/test_parquet_file.py:
##########
@@ -277,3 +278,77 @@ def test_pre_buffer(pre_buffer):
buf.seek(0)
pf = pq.ParquetFile(buf, pre_buffer=pre_buffer)
assert pf.read().num_rows == N
+
+
+def test_parquet_file_explicitly_closed(tmpdir):
+ """
+ Unopened files should be closed explicitly after use,
+ and previously opened files should be left open.
+ Applies to read_table, ParquetDataset, and ParquetFile
+ """
+ # create test parquet file
+ df = pd.DataFrame([{'col1': 0, 'col2': 0}, {'col1': 1, 'col2': 1}])
Review Comment:
```suggestion
table = pa.table({'col1': [0, 1], 'col2': [0, 1]})
```
We can directly create a pyarrow table instead of using a pandas DataFrame
here.
Because pandas is an optional dependency, we also allow running the tests
without pandas (through a marker), and so all tests using pandas are in
practice marked with `@pytest.mark.pandas`. But that also means that if it can
easily be avoided using pandas (like in this case, it's actually not more code
to not use pandas), it's best to do so.
##########
python/pyarrow/tests/parquet/test_parquet_file.py:
##########
@@ -277,3 +278,77 @@ def test_pre_buffer(pre_buffer):
buf.seek(0)
pf = pq.ParquetFile(buf, pre_buffer=pre_buffer)
assert pf.read().num_rows == N
+
+
+def test_parquet_file_explicitly_closed(tmpdir):
+ """
+ Unopened files should be closed explicitly after use,
+ and previously opened files should be left open.
+ Applies to read_table, ParquetDataset, and ParquetFile
+ """
+ # create test parquet file
+ df = pd.DataFrame([{'col1': 0, 'col2': 0}, {'col1': 1, 'col2': 1}])
+ fn = str(tmpdir.join('file.parquet'))
+ df.to_parquet(fn)
+
+ pytest.importorskip('fsspec')
Review Comment:
Although, with the mock approach, could we also mock NativeFile.close to
check that this is called? (that's the file object that will be created when
using the built-in filesystems) In that case we maybe don't need the
LocalTempFile / TestFileSystem at all.
##########
python/pyarrow/tests/parquet/test_parquet_file.py:
##########
@@ -277,3 +278,77 @@ def test_pre_buffer(pre_buffer):
buf.seek(0)
pf = pq.ParquetFile(buf, pre_buffer=pre_buffer)
assert pf.read().num_rows == N
+
+
+def test_parquet_file_explicitly_closed(tmpdir):
Review Comment:
```suggestion
def test_parquet_file_explicitly_closed(tempdir):
```
It's probably something we should update (now pytest has a better tmp_path
fixture), but at the moment we have an internal fixture `tempdir` that wraps
`tmpdir` in a pathlib.Path
##########
python/pyarrow/tests/parquet/test_parquet_file.py:
##########
@@ -277,3 +278,77 @@ def test_pre_buffer(pre_buffer):
buf.seek(0)
pf = pq.ParquetFile(buf, pre_buffer=pre_buffer)
assert pf.read().num_rows == N
+
+
+def test_parquet_file_explicitly_closed(tmpdir):
+ """
+ Unopened files should be closed explicitly after use,
+ and previously opened files should be left open.
+ Applies to read_table, ParquetDataset, and ParquetFile
+ """
+ # create test parquet file
+ df = pd.DataFrame([{'col1': 0, 'col2': 0}, {'col1': 1, 'col2': 1}])
+ fn = str(tmpdir.join('file.parquet'))
+ df.to_parquet(fn)
+
+ pytest.importorskip('fsspec')
Review Comment:
Similar for fsspec, that's also an optional test dependency. You properly
skipped the test if not available, which is fine for tests that need it, but
also it might be possible to test this slightly differently using a similar
pattern but with using just tools from pyarrow.
For example, as prior art, we have an `open_logging_fs` fixture at
https://github.com/apache/arrow/blob/74f221c925688a1fab05c0394818256d816ccfc1/python/pyarrow/tests/test_dataset.py#L125-L154
(this logs which files are opened)
I think we can do something similar, monkeypatching the filesystem's
`open_input_file` to return the below LocalTempFile
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]