[
https://issues.apache.org/jira/browse/PARQUET-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jim Pivarski updated PARQUET-1084:
----------------------------------
Comment: was deleted
(was: If the file is opened as a memory map (I don't know that I initiated
this, but perhaps it's the default), then it would be useful to know an
affected operating system. Here's mine:
{{% uname -a
Linux localhost 3.18.0-14875-g438cb8ab27c6 #1 SMP PREEMPT Tue Sep 12 13:55:56
PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
% lsb_release -a
LSB Version:
core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:cxx-3.0-amd64:cxx-3.0-noarch:cxx-3.1-amd64:cxx-3.1-noarch:cxx-3.2-amd64:cxx-3.2-noarch:cxx-4.0-amd64:cxx-4.0-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-3.1-amd64:desktop-3.1-noarch:desktop-3.2-amd64:desktop-3.2-noarch:desktop-4.0-amd64:desktop-4.0-noarch:desktop-4.1-amd64:desktop-4.1-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.0-amd64:graphics-3.0-noarch:graphics-3.1-amd64:graphics-3.1-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch:graphics-4.1-amd64:graphics-4.1-noarch:languages-3.2-amd64:languages-3.2-noarch:languages-4.0-amd64:languages-4.0-noarch:languages-4.1-amd64:languages-4.1-noarch:multimedia-3.2-amd64:multimedia-3.2-noarch:multimedia-4.0-amd64:multimedia-4.0-noarch:multimedia-4.1-amd64:multimedia-4.1-noarch:printing-3.2-amd64:printing-3.2-noarch:printing-4.0-amd64:printing-4.0-noarch:printing-4.1-amd64:printing-4.1-noarch:qt4-3.1-amd64:qt4-3.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty}}
)
> Parquet-C++ doesn't selectively read columns
> --------------------------------------------
>
> Key: PARQUET-1084
> URL: https://issues.apache.org/jira/browse/PARQUET-1084
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Affects Versions: cpp-1.0.0, cpp-1.2.0
> Reporter: Jim Pivarski
> Labels: performance
> Fix For: cpp-1.3.0
>
>
> I first saw this reported in a [review of file formats for
> C++](https://indico.cern.ch/event/567550/contributions/2628878/attachments/1511966/2358123/hep-file-formats.pdf),
> which showed that an attempt to read two columns from a Parquet file in C++
> resulted in the whole file— 26 columns— being read (18th page of the PDF, "15
> / 25" in the bottom-right corner). That test used Parquet-C++ version 1.2.0.
> To check this, I pip-installed pyarrow (version 0.6.0), which comes with
> Parquet-C++ version 1.0.0. I used [vmtouch](https://hoytech.com/vmtouch/) to
> identify the fraction of pages touched, and double-checked by measuring the
> time-to-load. The fact that it's a slow disk makes it obvious whether it's
> reading one column or all columns.
> I'm using the same files as the presenter of that talk:
> [B2HHH.parquet-inflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-inflated)
> and
> [B2HHH.parquet-deflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-deflated).
> They have 20 double-precision columns and 6 int32 columns with no nesting,
> 500 rows per group * 17113 row groups = 8556118 rows = 1.5 GB for the
> inflated (uncompressed) file. Each column within a row group should be 4000
> or 2000 bytes, so reading one column should be one or two 4k disk pages per
> row group out of 769 disk pages per row group, depending on alignment—
> granularity should not be a problem, as it would be if the row groups were
> too small.
> *Procedure:*
> # I evicted the uncompressed file from VM cache to force reads to come from
> disk.
> # I imported {{pyarrow.parquet}} in Python and called
> {{read_table("data/B2HHH-inflated.parquet", ["h1_px"])}} (one column).
> # I checked to see how much of the file has been loaded into VM cache.
> # I also checked the time-to-load of one column from cold cache versus all
> columns from cold cache.
> The result is that the entire file get loaded into VM cache and the file
> takes 14.6 seconds to read regardless of whether I read one column or the
> whole file. (From warm cache is 4.7 seconds, so we're clearly seeing the
> effect of disk speed.) Both methods agree that the file is _not_ being
> selectively read, as I think it should be.
> Is there a setting that the presenter of the talk (using Parquet-C++ version
> 1.2.0 in C++) and I (using pyarrow with Parquet-C++ 1.0.0 in Python) are both
> missing? Is this a future feature? I would consider it to be a performance
> bug, since a major reason for having a columnar data format is to read
> columns selectively.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)