[jira] [Issue Comment Deleted] (PARQUET-1084) Parquet-C++ doesn't selectively read columns

Jim Pivarski (JIRA) Thu, 28 Sep 2017 09:01:21 -0700

     [ 
https://issues.apache.org/jira/browse/PARQUET-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jim Pivarski updated PARQUET-1084:
----------------------------------
    Comment: was deleted

(was: If the file is opened as a memory map (I don't know that I initiated 
this, but perhaps it's the default), then it would be useful to know an 
affected operating system. Here's mine:

{{% uname -a
Linux localhost 3.18.0-14875-g438cb8ab27c6 #1 SMP PREEMPT Tue Sep 12 13:55:56 
PDT 2017 x86_64 x86_64 x86_64 GNU/Linux

% lsb_release -a
LSB Version:    
core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:cxx-3.0-amd64:cxx-3.0-noarch:cxx-3.1-amd64:cxx-3.1-noarch:cxx-3.2-amd64:cxx-3.2-noarch:cxx-4.0-amd64:cxx-4.0-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-3.1-amd64:desktop-3.1-noarch:desktop-3.2-amd64:desktop-3.2-noarch:desktop-4.0-amd64:desktop-4.0-noarch:desktop-4.1-amd64:desktop-4.1-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.0-amd64:graphics-3.0-noarch:graphics-3.1-amd64:graphics-3.1-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch:graphics-4.1-amd64:graphics-4.1-noarch:languages-3.2-amd64:languages-3.2-noarch:languages-4.0-amd64:languages-4.0-noarch:languages-4.1-amd64:languages-4.1-noarch:multimedia-3.2-amd64:multimedia-3.2-noarch:multimedia-4.0-amd64:multimedia-4.0-noarch:multimedia-4.1-amd64:multimedia-4.1-noarch:printing-3.2-amd64:printing-3.2-noarch:printing-4.0-amd64:printing-4.0-noarch:printing-4.1-amd64:printing-4.1-noarch:qt4-3.1-amd64:qt4-3.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:        14.04
Codename:       trusty}}
)

> Parquet-C++ doesn't selectively read columns
> --------------------------------------------
>
>                 Key: PARQUET-1084
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1084
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-1.0.0, cpp-1.2.0
>            Reporter: Jim Pivarski
>              Labels: performance
>             Fix For: cpp-1.3.0
>
>
> I first saw this reported in a [review of file formats for 
> C++](https://indico.cern.ch/event/567550/contributions/2628878/attachments/1511966/2358123/hep-file-formats.pdf),
>  which showed that an attempt to read two columns from a Parquet file in C++ 
> resulted in the whole file— 26 columns— being read (18th page of the PDF, "15 
> / 25" in the bottom-right corner). That test used Parquet-C++ version 1.2.0.
> To check this, I pip-installed pyarrow (version 0.6.0), which comes with 
> Parquet-C++ version 1.0.0. I used [vmtouch](https://hoytech.com/vmtouch/) to 
> identify the fraction of pages touched, and double-checked by measuring the 
> time-to-load. The fact that it's a slow disk makes it obvious whether it's 
> reading one column or all columns.
> I'm using the same files as the presenter of that talk: 
> [B2HHH.parquet-inflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-inflated)
>  and 
> [B2HHH.parquet-deflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-deflated).
>  They have 20 double-precision columns and 6 int32 columns with no nesting, 
> 500 rows per group * 17113 row groups = 8556118 rows = 1.5 GB for the 
> inflated (uncompressed) file. Each column within a row group should be 4000 
> or 2000 bytes, so reading one column should be one or two 4k disk pages per 
> row group out of 769 disk pages per row group, depending on alignment— 
> granularity should not be a problem, as it would be if the row groups were 
> too small.
> *Procedure:*
> # I evicted the uncompressed file from VM cache to force reads to come from 
> disk.
> # I imported {{pyarrow.parquet}} in Python and called 
> {{read_table("data/B2HHH-inflated.parquet", ["h1_px"])}} (one column).
> # I checked to see how much of the file has been loaded into VM cache.
> # I also checked the time-to-load of one column from cold cache versus all 
> columns from cold cache.
> The result is that the entire file get loaded into VM cache and the file 
> takes 14.6 seconds to read regardless of whether I read one column or the 
> whole file. (From warm cache is 4.7 seconds, so we're clearly seeing the 
> effect of disk speed.) Both methods agree that the file is _not_ being 
> selectively read, as I think it should be.
> Is there a setting that the presenter of the talk (using Parquet-C++ version 
> 1.2.0 in C++) and I (using pyarrow with Parquet-C++ 1.0.0 in Python) are both 
> missing? Is this a future feature? I would consider it to be a performance 
> bug, since a major reason for having a columnar data format is to read 
> columns selectively.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Issue Comment Deleted] (PARQUET-1084) Parquet-C++ doesn't selectively read columns

Reply via email to