[ 
https://issues.apache.org/jira/browse/ARROW-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6796.
-------------------------------
    Resolution: Duplicate

Dup of ARROW-6060

> Certain moderately-sized (~100MB) default-Snappy-compressed Parquet files 
> take enormous memory and long time to load by pyarrow.parquet.read_table
> --------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-6796
>                 URL: https://issues.apache.org/jira/browse/ARROW-6796
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 0.14.1
>            Reporter: V Luong
>            Priority: Critical
>             Fix For: 0.15.0
>
>
> My Spark workloads produce small-to-moderately-sized Parquet files with 
> typical on-disk sizes in the order of 100-300MB, and I use PyArrow to process 
> these files further.
> Surprisingly, I find that similarly-sized Parquet files sometimes take vastly 
> different amounts of memory and time to load using 
> pyarrow.parquet.read_table. For illustration, I've uploaded 2 such parquet 
> files to s3://public-parquet-test-data/fast.snappy.parquet and 
> s3://public-parquet-test-data/slow.snappy.parquet.
> Both files have about 1.2 million rows and 450 columns and occupy 100-120MB 
> on disk. But when they are loaded by read_table:
>  * fast.snappy.parquet takes 10-15GB of memory and 5-8s to load
>  * slow.snappy.parquet takes up to 300GB (!!) of memory and 45-60s to load
> Since I have been using the default Snappy compression in all my Spark jobs, 
> it is unlikely that the files differ in the their compression levels. That 
> the on-disk sizes are similar suggests that they are similarly compressed. So 
> the fact that slow.snappy.parquet takes 10-20x amounts of resources to read 
> is very surprising.
> My benchmarking code snippet is below. I'd appreciate your help to 
> troubleshoot this matter.
> ```{python}
> from pyarrow.parquet import read_metadata, read_table
> from time import time
> from tqdm import tqdm
> ​
> ​
> FAST_PARQUET_TMP_PATH = '/tmp/fast.snappy.parquet'
> SLOW_PARQUET_TMP_PATH = '/tmp/slow.snappy.parquet'
> ​
> ​
> fast_parquet_metadata = read_metadata(FAST_PARQUET_TMP_PATH)
> print('Fast Parquet Metadata: {}\n'.format(fast_parquet_metadata))
> ​
> durations = []
> for _ in tqdm(range(3)):
>     tic = time()
>     tbl = read_table(
>             source=FAST_PARQUET_TMP_PATH,
>             columns=None,
>             use_threads=True,
>             metadata=None,
>             use_pandas_metadata=False,
>             memory_map=False,
>             filesystem=None,
>             filters=None)
>     toc = time()
>     durations.append(toc-tic)
> print('Fast Parquet READ_TABLE(...) Durations: {}\n'
>       .format(', '.join('{:.0f}s'.format(duration) for duration in 
> durations)))
> ​
> ​
> slow_parquet_metadata = read_metadata(SLOW_PARQUET_TMP_PATH)
> print('Slow Parquet Metadata: {}\n'.format(slow_parquet_metadata))
> ​
> durations = []
> for _ in tqdm(range(3)):
>     tic = time()
>     tbl = read_table(
>             source=SLOW_PARQUET_TMP_PATH,
>             columns=None,
>             use_threads=True,
>             metadata=None,
>             use_pandas_metadata=False,
>             memory_map=False,
>             filesystem=None,
>             filters=None)
>     toc = time()
>     durations.append(toc - tic)
> print('Slow Parquet READ_TABLE(...) Durations: {}\n'
>       .format(', '.join('{:.0f}s'.format(duration) for duration in 
> durations)))
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to