[ https://issues.apache.org/jira/browse/ARROW-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David McGuire updated ARROW-10029: ---------------------------------- Description: @martindurant good news (for you): I have a repro test case that is 100% {{pyarrow}}, so it looks like {{s3fs}} is not involved. @jorisvandenbossche how should I follow up with this, based on {{pyarrow.filesystem.LocalFileSystem}}? Viewing the File System *directories* as a tree, one thread is required for every non-leaf node, in order to avoid deadlock. 1) dataset 2) dataset/foo=1 3) dataset/foo=1/bar=2 4) dataset/foo=1/bar=2/baz=0 5) dataset/foo=1/bar=2/baz=1 6) dataset/foo=1/bar=2/baz=2 *) dataset/foo=1/bar=2/baz=0/qux=false *) dataset/foo=1/bar=2/baz=1/qux=false *) dataset/foo=1/bar=2/baz=1/qux=true *) dataset/foo=1/bar=2/baz=0/qux=true *) dataset/foo=1/bar=2/baz=2/qux=false *) dataset/foo=1/bar=2/baz=2/qux=true {quote} import pyarrow.parquet as pq import pyarrow.filesystem as fs class LoggingLocalFileSystem(fs.LocalFileSystem): def walk(self, path): print(path) return super().walk(path) fs = LoggingLocalFileSystem() dataset_url = "dataset" threads = 6 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads) print(len(dataset.pieces)) threads = 5 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads) print(len(dataset.pieces)) {quote} *_ Call with 6 threads completes. Call with 5 threads hangs indefinitely._* {quote} $ python repro.py dataset dataset/foo=1 dataset/foo=1/bar=2 dataset/foo=1/bar=2/baz=0 dataset/foo=1/bar=2/baz=1 dataset/foo=1/bar=2/baz=2 dataset/foo=1/bar=2/baz=0/qux=false dataset/foo=1/bar=2/baz=0/qux=true dataset/foo=1/bar=2/baz=1/qux=false dataset/foo=1/bar=2/baz=1/qux=true dataset/foo=1/bar=2/baz=2/qux=false dataset/foo=1/bar=2/baz=2/qux=true 6 dataset dataset/foo=1 dataset/foo=1/bar=2 dataset/foo=1/bar=2/baz=0 dataset/foo=1/bar=2/baz=1 dataset/foo=1/bar=2/baz=2 ^C ... KeyboardInterrupt ^C ... KeyboardInterrupt {quote} **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and when omitting the {{filesystem}} argument. was: @martindurant good news (for you): I have a repro test case that is 100% {{pyarrow}}, so it looks like {{s3fs}} is not involved. @jorisvandenbossche how should I follow up with this, based on {{pyarrow.filesystem.LocalFileSystem}}? {quote} import pyarrow.parquet as pq import pyarrow.filesystem as fs class LoggingLocalFileSystem(fs.LocalFileSystem): def walk(self, path): print(path) return super().walk(path) fs = LoggingLocalFileSystem() dataset_url = "dataset" # Viewing the File System *directories* as a tree, one thread is required for every non-leaf node, # in order to avoid deadlock # 1) dataset # 2) dataset/foo=1 # 3) dataset/foo=1/bar=2 # 4) dataset/foo=1/bar=2/baz=0 # 5) dataset/foo=1/bar=2/baz=1 # 6) dataset/foo=1/bar=2/baz=2 # *) dataset/foo=1/bar=2/baz=0/qux=false # *) dataset/foo=1/bar=2/baz=1/qux=false # *) dataset/foo=1/bar=2/baz=1/qux=true # *) dataset/foo=1/bar=2/baz=0/qux=true # *) dataset/foo=1/bar=2/baz=2/qux=false # *) dataset/foo=1/bar=2/baz=2/qux=true # This completes threads = 6 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads) print(len(dataset.pieces)) # This hangs indefinitely threads = 5 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads) print(len(dataset.pieces)) {quote} {quote} $ python repro.py dataset dataset/foo=1 dataset/foo=1/bar=2 dataset/foo=1/bar=2/baz=0 dataset/foo=1/bar=2/baz=1 dataset/foo=1/bar=2/baz=2 dataset/foo=1/bar=2/baz=0/qux=false dataset/foo=1/bar=2/baz=0/qux=true dataset/foo=1/bar=2/baz=1/qux=false dataset/foo=1/bar=2/baz=1/qux=true dataset/foo=1/bar=2/baz=2/qux=false dataset/foo=1/bar=2/baz=2/qux=true 6 dataset dataset/foo=1 dataset/foo=1/bar=2 dataset/foo=1/bar=2/baz=0 dataset/foo=1/bar=2/baz=1 dataset/foo=1/bar=2/baz=2 ^C ... KeyboardInterrupt ^C ... KeyboardInterrupt {quote} **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and when omitting the {{filesystem}} argument. > Deadlock in the interaction of pyarrow FileSystem and ParquetDataset > -------------------------------------------------------------------- > > Key: ARROW-10029 > URL: https://issues.apache.org/jira/browse/ARROW-10029 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.1 > Reporter: David McGuire > Priority: Major > > @martindurant good news (for you): I have a repro test case that is 100% > {{pyarrow}}, so it looks like {{s3fs}} is not involved. > @jorisvandenbossche how should I follow up with this, based on > {{pyarrow.filesystem.LocalFileSystem}}? > Viewing the File System *directories* as a tree, one thread is required for > every non-leaf node, in order to avoid deadlock. > 1) dataset > 2) dataset/foo=1 > 3) dataset/foo=1/bar=2 > 4) dataset/foo=1/bar=2/baz=0 > 5) dataset/foo=1/bar=2/baz=1 > 6) dataset/foo=1/bar=2/baz=2 > *) dataset/foo=1/bar=2/baz=0/qux=false > *) dataset/foo=1/bar=2/baz=1/qux=false > *) dataset/foo=1/bar=2/baz=1/qux=true > *) dataset/foo=1/bar=2/baz=0/qux=true > *) dataset/foo=1/bar=2/baz=2/qux=false > *) dataset/foo=1/bar=2/baz=2/qux=true > {quote} > import pyarrow.parquet as pq > import pyarrow.filesystem as fs > class LoggingLocalFileSystem(fs.LocalFileSystem): > def walk(self, path): > print(path) > return super().walk(path) > fs = LoggingLocalFileSystem() > dataset_url = "dataset" > threads = 6 > dataset = pq.ParquetDataset(dataset_url, filesystem=fs, > validate_schema=False, metadata_nthreads=threads) > print(len(dataset.pieces)) > threads = 5 > dataset = pq.ParquetDataset(dataset_url, filesystem=fs, > validate_schema=False, metadata_nthreads=threads) > print(len(dataset.pieces)) > {quote} > *_ > Call with 6 threads completes. > Call with 5 threads hangs indefinitely._* > {quote} > $ python repro.py > dataset > dataset/foo=1 > dataset/foo=1/bar=2 > dataset/foo=1/bar=2/baz=0 > dataset/foo=1/bar=2/baz=1 > dataset/foo=1/bar=2/baz=2 > dataset/foo=1/bar=2/baz=0/qux=false > dataset/foo=1/bar=2/baz=0/qux=true > dataset/foo=1/bar=2/baz=1/qux=false > dataset/foo=1/bar=2/baz=1/qux=true > dataset/foo=1/bar=2/baz=2/qux=false > dataset/foo=1/bar=2/baz=2/qux=true > 6 > dataset > dataset/foo=1 > dataset/foo=1/bar=2 > dataset/foo=1/bar=2/baz=0 > dataset/foo=1/bar=2/baz=1 > dataset/foo=1/bar=2/baz=2 > ^C > ... > KeyboardInterrupt > ^C > ... > KeyboardInterrupt > {quote} > **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and > when omitting the {{filesystem}} argument. -- This message was sent by Atlassian Jira (v8.3.4#803005)