Jim Pivarski created PARQUET-1084:
-------------------------------------

             Summary: Parquet-C++ doesn't selectively read columns
                 Key: PARQUET-1084
                 URL: https://issues.apache.org/jira/browse/PARQUET-1084
             Project: Parquet
          Issue Type: Bug
          Components: parquet-cpp
    Affects Versions: cpp-1.2.0, cpp-1.0.0
            Reporter: Jim Pivarski


I first saw this reported in a [review of file formats for 
C++](https://indico.cern.ch/event/567550/contributions/2628878/attachments/1511966/2358123/hep-file-formats.pdf),
 which showed that an attempt to read two columns from a Parquet file in C++ 
resulted in the whole file— 26 columns— being read (18th page of the PDF, "15 / 
25" in the bottom-right corner). That test used Parquet-C++ version 1.2.0.

To check this, I pip-installed pyarrow (version 0.6.0), which comes with 
Parquet-C++ version 1.0.0. I used [vmtouch](https://hoytech.com/vmtouch/) to 
identify the fraction of pages touched, and double-checked by measuring the 
time-to-load. The fact that it's a slow disk makes it obvious whether it's 
reading one column or all columns.

I'm using the same files as the presenter of that talk: 
[B2HHH.parquet-inflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-inflated)
 and 
[B2HHH.parquet-deflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-deflated).
 They have 20 double-precision columns and 6 int32 columns with no nesting, 500 
rows per group * 17113 row groups = 8556118 rows = 1.5 GB for the inflated 
(uncompressed) file. Each column within a row group should be 4000 or 2000 
bytes, so reading one column should be one or two 4k disk pages per row group 
out of 769 disk pages per row group, depending on alignment— granularity should 
not be a problem, as it would be if the row groups were too small.

*Procedure:*
# I evicted the uncompressed file from VM cache to force reads to come from 
disk.
# I imported {{pyarrow.parquet}} in Python and called 
{{read_table("data/B2HHH-inflated.parquet", ["h1_px"])}} (one column).
# I checked to see how much of the file has been loaded into VM cache.
# I also checked the time-to-load of one column from cold cache versus all 
columns from cold cache.

The result is that the entire file get loaded into VM cache and the file takes 
14.6 seconds to read regardless of whether I read one column or the whole file. 
(From warm cache is 4.7 seconds, so we're clearly seeing the effect of disk 
speed.) Both methods agree that the file is _not_ being selectively read, as I 
think it should be.

Is there a setting that the presenter of the talk (using Parquet-C++ version 
1.2.0 in C++) and I (using pyarrow with Parquet-C++ 1.0.0 in Python) are both 
missing? Is this a future feature? I would consider it to be a performance bug, 
since a major reason for having a columnar data format is to read columns 
selectively.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to