[
https://issues.apache.org/jira/browse/ARROW-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David McGuire updated ARROW-10029:
----------------------------------
Description:
@martindurant good news (for you): I have a repro test case that is 100%
{{pyarrow}}, so it looks like {{s3fs}} is not involved.
@jorisvandenbossche how should I follow up with this, based on
{{pyarrow.filesystem.LocalFileSystem}}?
{quote}
import pyarrow.parquet as pq
import pyarrow.filesystem as fs
class LoggingLocalFileSystem(fs.LocalFileSystem):
def walk(self, path):
print(path)
return super().walk(path)
fs = LoggingLocalFileSystem()
dataset_url = "dataset"
# Viewing the File System *directories* as a tree, one thread is required for
every non-leaf node,
# in order to avoid deadlock
# 1) dataset
# 2) dataset/foo=1
# 3) dataset/foo=1/bar=2
# 4) dataset/foo=1/bar=2/baz=0
# 5) dataset/foo=1/bar=2/baz=1
# 6) dataset/foo=1/bar=2/baz=2
# *) dataset/foo=1/bar=2/baz=0/qux=false
# *) dataset/foo=1/bar=2/baz=1/qux=false
# *) dataset/foo=1/bar=2/baz=1/qux=true
# *) dataset/foo=1/bar=2/baz=0/qux=true
# *) dataset/foo=1/bar=2/baz=2/qux=false
# *) dataset/foo=1/bar=2/baz=2/qux=true
# This completes
threads = 6
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False,
metadata_nthreads=threads)
print(len(dataset.pieces))
# This hangs indefinitely
threads = 5
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False,
metadata_nthreads=threads)
print(len(dataset.pieces))
{quote}
{quote}
$ python repro.py
dataset
dataset/foo=1
dataset/foo=1/bar=2
dataset/foo=1/bar=2/baz=0
dataset/foo=1/bar=2/baz=1
dataset/foo=1/bar=2/baz=2
dataset/foo=1/bar=2/baz=0/qux=false
dataset/foo=1/bar=2/baz=0/qux=true
dataset/foo=1/bar=2/baz=1/qux=false
dataset/foo=1/bar=2/baz=1/qux=true
dataset/foo=1/bar=2/baz=2/qux=false
dataset/foo=1/bar=2/baz=2/qux=true
6
dataset
dataset/foo=1
dataset/foo=1/bar=2
dataset/foo=1/bar=2/baz=0
dataset/foo=1/bar=2/baz=1
dataset/foo=1/bar=2/baz=2
^C
...
KeyboardInterrupt
^C
...
KeyboardInterrupt
{quote}
**NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and
when omitting the {{filesystem}} argument.
was:
@martindurant good news (for you): I have a repro test case that is 100%
`pyarrow`, so it looks like `s3fs` is not involved.
@jorisvandenbossche how should I follow up with this, based on
`pyarrow.filesystem.LocalFileSystem`?
```python
import pyarrow.parquet as pq
import pyarrow.filesystem as fs
class LoggingLocalFileSystem(fs.LocalFileSystem):
def walk(self, path):
print(path)
return super().walk(path)
fs = LoggingLocalFileSystem()
dataset_url = "dataset"
# Viewing the File System *directories* as a tree, one thread is required for
every non-leaf node,
# in order to avoid deadlock
# 1) dataset
# 2) dataset/foo=1
# 3) dataset/foo=1/bar=2
# 4) dataset/foo=1/bar=2/baz=0
# 5) dataset/foo=1/bar=2/baz=1
# 6) dataset/foo=1/bar=2/baz=2
# *) dataset/foo=1/bar=2/baz=0/qux=false
# *) dataset/foo=1/bar=2/baz=1/qux=false
# *) dataset/foo=1/bar=2/baz=1/qux=true
# *) dataset/foo=1/bar=2/baz=0/qux=true
# *) dataset/foo=1/bar=2/baz=2/qux=false
# *) dataset/foo=1/bar=2/baz=2/qux=true
# This completes
threads = 6
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False,
metadata_nthreads=threads)
print(len(dataset.pieces))
# This hangs indefinitely
threads = 5
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False,
metadata_nthreads=threads)
print(len(dataset.pieces))
```
```bash
$ python repro.py
dataset
dataset/foo=1
dataset/foo=1/bar=2
dataset/foo=1/bar=2/baz=0
dataset/foo=1/bar=2/baz=1
dataset/foo=1/bar=2/baz=2
dataset/foo=1/bar=2/baz=0/qux=false
dataset/foo=1/bar=2/baz=0/qux=true
dataset/foo=1/bar=2/baz=1/qux=false
dataset/foo=1/bar=2/baz=1/qux=true
dataset/foo=1/bar=2/baz=2/qux=false
dataset/foo=1/bar=2/baz=2/qux=true
6
dataset
dataset/foo=1
dataset/foo=1/bar=2
dataset/foo=1/bar=2/baz=0
dataset/foo=1/bar=2/baz=1
dataset/foo=1/bar=2/baz=2
^C
...
KeyboardInterrupt
^C
...
KeyboardInterrupt
```
**NOTE:** this *also* happens with the un-decorated `LocalFileSystem`, and when
omitting the `filesystem` argument.
> Deadlock in the interaction of pyarrow FileSystem and ParquetDataset
> --------------------------------------------------------------------
>
> Key: ARROW-10029
> URL: https://issues.apache.org/jira/browse/ARROW-10029
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.1
> Reporter: David McGuire
> Priority: Major
>
> @martindurant good news (for you): I have a repro test case that is 100%
> {{pyarrow}}, so it looks like {{s3fs}} is not involved.
> @jorisvandenbossche how should I follow up with this, based on
> {{pyarrow.filesystem.LocalFileSystem}}?
> {quote}
> import pyarrow.parquet as pq
> import pyarrow.filesystem as fs
> class LoggingLocalFileSystem(fs.LocalFileSystem):
> def walk(self, path):
> print(path)
> return super().walk(path)
> fs = LoggingLocalFileSystem()
> dataset_url = "dataset"
> # Viewing the File System *directories* as a tree, one thread is required for
> every non-leaf node,
> # in order to avoid deadlock
> # 1) dataset
> # 2) dataset/foo=1
> # 3) dataset/foo=1/bar=2
> # 4) dataset/foo=1/bar=2/baz=0
> # 5) dataset/foo=1/bar=2/baz=1
> # 6) dataset/foo=1/bar=2/baz=2
> # *) dataset/foo=1/bar=2/baz=0/qux=false
> # *) dataset/foo=1/bar=2/baz=1/qux=false
> # *) dataset/foo=1/bar=2/baz=1/qux=true
> # *) dataset/foo=1/bar=2/baz=0/qux=true
> # *) dataset/foo=1/bar=2/baz=2/qux=false
> # *) dataset/foo=1/bar=2/baz=2/qux=true
> # This completes
> threads = 6
> dataset = pq.ParquetDataset(dataset_url, filesystem=fs,
> validate_schema=False, metadata_nthreads=threads)
> print(len(dataset.pieces))
> # This hangs indefinitely
> threads = 5
> dataset = pq.ParquetDataset(dataset_url, filesystem=fs,
> validate_schema=False, metadata_nthreads=threads)
> print(len(dataset.pieces))
> {quote}
> {quote}
> $ python repro.py
> dataset
> dataset/foo=1
> dataset/foo=1/bar=2
> dataset/foo=1/bar=2/baz=0
> dataset/foo=1/bar=2/baz=1
> dataset/foo=1/bar=2/baz=2
> dataset/foo=1/bar=2/baz=0/qux=false
> dataset/foo=1/bar=2/baz=0/qux=true
> dataset/foo=1/bar=2/baz=1/qux=false
> dataset/foo=1/bar=2/baz=1/qux=true
> dataset/foo=1/bar=2/baz=2/qux=false
> dataset/foo=1/bar=2/baz=2/qux=true
> 6
> dataset
> dataset/foo=1
> dataset/foo=1/bar=2
> dataset/foo=1/bar=2/baz=0
> dataset/foo=1/bar=2/baz=1
> dataset/foo=1/bar=2/baz=2
> ^C
> ...
> KeyboardInterrupt
> ^C
> ...
> KeyboardInterrupt
> {quote}
> **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and
> when omitting the {{filesystem}} argument.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)