[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13821: ARROW-13763: [Python] Close files in ParquetFile & ParquetDatasetPiece

GitBox Wed, 10 Aug 2022 06:32:34 -0700


jorisvandenbossche commented on code in PR #13821:
URL: https://github.com/apache/arrow/pull/13821#discussion_r942413069



##########
python/pyarrow/tests/parquet/test_parquet_file.py:
##########
@@ -277,3 +278,77 @@ def test_pre_buffer(pre_buffer):
     buf.seek(0)
     pf = pq.ParquetFile(buf, pre_buffer=pre_buffer)
     assert pf.read().num_rows == N
+
+
+def test_parquet_file_explicitly_closed(tmpdir):
+    """
+    Unopened files should be closed explicitly after use,
+    and previously opened files should be left open.
+    Applies to read_table, ParquetDataset, and ParquetFile
+    """
+    # create test parquet file
+    df = pd.DataFrame([{'col1': 0, 'col2': 0}, {'col1': 1, 'col2': 1}])

Review Comment:
   ```suggestion
       table = pa.table({'col1': [0, 1], 'col2': [0, 1]})
   ```
   
   We can directly create a pyarrow table instead of using a pandas DataFrame 
here. 
   Because pandas is an optional dependency, we also allow running the tests 
without pandas (through a marker), and so all tests using pandas are in 
practice marked with `@pytest.mark.pandas`. But that also means that if it can 
easily be avoided using pandas (like in this case, it's actually not more code 
to not use pandas), it's best to do so.



##########
python/pyarrow/tests/parquet/test_parquet_file.py:
##########
@@ -277,3 +278,77 @@ def test_pre_buffer(pre_buffer):
     buf.seek(0)
     pf = pq.ParquetFile(buf, pre_buffer=pre_buffer)
     assert pf.read().num_rows == N
+
+
+def test_parquet_file_explicitly_closed(tmpdir):
+    """
+    Unopened files should be closed explicitly after use,
+    and previously opened files should be left open.
+    Applies to read_table, ParquetDataset, and ParquetFile
+    """
+    # create test parquet file
+    df = pd.DataFrame([{'col1': 0, 'col2': 0}, {'col1': 1, 'col2': 1}])
+    fn = str(tmpdir.join('file.parquet'))
+    df.to_parquet(fn)
+
+    pytest.importorskip('fsspec')

Review Comment:
   Although, with the mock approach, could we also mock NativeFile.close to 
check that this is called? (that's the file object that will be created when 
using the built-in filesystems) In that case we maybe don't need the 
LocalTempFile / TestFileSystem at all.



##########
python/pyarrow/tests/parquet/test_parquet_file.py:
##########
@@ -277,3 +278,77 @@ def test_pre_buffer(pre_buffer):
     buf.seek(0)
     pf = pq.ParquetFile(buf, pre_buffer=pre_buffer)
     assert pf.read().num_rows == N
+
+
+def test_parquet_file_explicitly_closed(tmpdir):

Review Comment:
   ```suggestion
   def test_parquet_file_explicitly_closed(tempdir):
   ```
   
   It's probably something we should update (now pytest has a better tmp_path 
fixture), but at the moment we have an internal fixture `tempdir` that wraps 
`tmpdir` in a pathlib.Path



##########
python/pyarrow/tests/parquet/test_parquet_file.py:
##########
@@ -277,3 +278,77 @@ def test_pre_buffer(pre_buffer):
     buf.seek(0)
     pf = pq.ParquetFile(buf, pre_buffer=pre_buffer)
     assert pf.read().num_rows == N
+
+
+def test_parquet_file_explicitly_closed(tmpdir):
+    """
+    Unopened files should be closed explicitly after use,
+    and previously opened files should be left open.
+    Applies to read_table, ParquetDataset, and ParquetFile
+    """
+    # create test parquet file
+    df = pd.DataFrame([{'col1': 0, 'col2': 0}, {'col1': 1, 'col2': 1}])
+    fn = str(tmpdir.join('file.parquet'))
+    df.to_parquet(fn)
+
+    pytest.importorskip('fsspec')

Review Comment:
   Similar for fsspec, that's also an optional test dependency. You properly 
skipped the test if not available, which is fine for tests that need it, but 
also it might be possible to test this slightly differently using a similar 
pattern but with using just tools from pyarrow. 
   
   For example, as prior art, we have an `open_logging_fs` fixture at 
https://github.com/apache/arrow/blob/74f221c925688a1fab05c0394818256d816ccfc1/python/pyarrow/tests/test_dataset.py#L125-L154
 (this logs which files are opened) 
   I think we can do something similar, monkeypatching the filesystem's 
`open_input_file` to return the below LocalTempFile 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13821: ARROW-13763: [Python] Close files in ParquetFile & ParquetDatasetPiece

Reply via email to