Taylor Johnson created ARROW-4633: ------------------------------------- Summary: ParquetFile.read(use_threads=False) creates ThreadPool anyway Key: ARROW-4633 URL: https://issues.apache.org/jira/browse/ARROW-4633 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.12.0, 0.11.1 Environment: Linux, Python 3.7.1, pyarrow.__version__ = 0.12.0 Reporter: Taylor Johnson
The following code seems to suggest that ParquetFile.read(use_threads=False) still creates a ThreadPool. This is observed in ParquetFile.read_row_group(use_threads=False) as well. This does not appear to be a problem in pyarrow.Table.to_pandas(use_threads=False). I've tried tracing the error. Starting in python/pyarrow/parquet.py, both ParquetReader.read_all() and ParquetReader.read_row_group() pass the use_threads input along to self.reader which is a ParquetReader imported from _parquet.pyx Following the calls into python/pyarrow/_parquet.pyx, we see that ParquetReader.read_all() and ParquetReader.read_row_group() have the following code which seems a bit suspicious {quote}if use_threads: self.set_use_threads(use_threads) {quote} Why not just always call self.set_use_threads(use_threads)? The ParquetReader.set_use_threads simply calls self.reader.get().set_use_threads(use_threads). This self.reader is assigned as unique_ptr[FileReader]. I think this points to cpp/src/parquet/arrow/reader.cc, but I'm not sure about that. The FileReader::Impl::ReadRowGroup logic looks ok, as a call to ::arrow::internal::GetCpuThreadPool() is only called if use_threads is True. The same is true for ReadTable. So when is the ThreadPool getting created? Example code: -------------------------------------------------- {quote}import pandas as pd import psutil import pyarrow as pa import pyarrow.parquet as pq use_threads=False p=psutil.Process() print('Starting with {} threads'.format(p.num_threads())) df = pd.DataFrame(\{'x':[0]}) table = pa.Table.from_pandas(df) print('After table creation, {} threads'.format(p.num_threads())) df = table.to_pandas(use_threads=use_threads) print('table.to_pandas(use_threads={}), {} threads'.format(use_threads, p.num_threads())) writer = pq.ParquetWriter('tmp.parquet', table.schema) writer.write_table(table) writer.close() print('After writing parquet file, {} threads'.format(p.num_threads())) pf = pq.ParquetFile('tmp.parquet') print('After ParquetFile, {} threads'.format(p.num_threads())) df = pf.read(use_threads=use_threads).to_pandas() print('After pf.read(use_threads={}), {} threads'.format(use_threads, p.num_threads())) {quote} ----------------------------------------------------------------------- $ python pyarrow_test.py Starting with 1 threads After table creation, 1 threads table.to_pandas(use_threads=False), 1 threads After writing parquet file, 1 threads After ParquetFile, 1 threads After pf.read(use_threads=False), 5 threads -- This message was sent by Atlassian JIRA (v7.6.3#76005)