V Luong created ARROW-6796: ------------------------------ Summary: Certain moderately-sized (~100MB) default-Snappy-compressed Parquet files take enormous memory and long time to load by pyarrow.parquet.read_table Key: ARROW-6796 URL: https://issues.apache.org/jira/browse/ARROW-6796 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.14.1 Reporter: V Luong
My Spark workloads produce small-to-moderately-sized Parquet files with typical on-disk sizes in the order of 100-300MB, and I use PyArrow to process these files further. Surprisingly, I find that similarly-sized Parquet files sometimes take very extremely different amounts of memory and time to load using pyarrow.parquet.read_table. For illustration, I've uploaded 2 such parquet files to s3://public-parquet-test-data/fast.snappy.parquet and s3://public-parquet-test-data/slow.snappy.parquet. Both files have about 1.2 million rows and 450 columns and occupy 100-120MB on disk. But when they are loaded by read_table: * `fast.snappy.parquet` takes 10-15GB of memory and 5-8s to load * `slow.snappy.parquet` takes up to 300GB (!!) of memory and 45-60s to load Since I have been using the default Snappy compression in all my Spark jobs, it is unlikely that the files differ in the their compression levels. That the on-disk sizes are similar suggest that they are similarly compressed. So the fact that `slow.snappy.parquet` takes 10-20x amounts of resources to read is very surprising. My benchmarking code snippet is below. I'd appreciate your help to troubleshoot this matter. ```{python} from pyarrow.parquet import read_metadata, read_table from time import time from tqdm import tqdm FAST_PARQUET_TMP_PATH = '/tmp/fast.parquet' SLOW_PARQUET_TMP_PATH = '/tmp/slow.parquet' fast_parquet_metadata = read_metadata(FAST_PARQUET_TMP_PATH) print('Fast Parquet Metadata: {}\n'.format(fast_parquet_metadata)) durations = [] for _ in tqdm(range(3)): tic = time() tbl = read_table( source=FAST_PARQUET_TMP_PATH, columns=None, use_threads=True, metadata=None, use_pandas_metadata=False, memory_map=False, filesystem=None, filters=None) toc = time() durations.append(toc-tic) print('Fast Parquet READ_TABLE(...) Durations: {}\n' .format(', '.join('{:.0f}s'.format(duration) for duration in durations))) slow_parquet_metadata = read_metadata(SLOW_PARQUET_TMP_PATH) print('Slow Parquet Metadata: {}\n'.format(slow_parquet_metadata)) durations = [] for _ in tqdm(range(3)): tic = time() tbl = read_table( source=SLOW_PARQUET_TMP_PATH, columns=None, use_threads=True, metadata=None, use_pandas_metadata=False, memory_map=False, filesystem=None, filters=None) toc = time() durations.append(toc - tic) print('Slow Parquet READ_TABLE(...) Durations: {}\n' .format(', '.join('{:.0f}s'.format(duration) for duration in durations))) ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)