hi all,

Just to add some of my perspective (and I would like to write up some
longer form thoughts since I've been collaborating / talking with the
Nimble and Lance folks -- and as a result I know a lot about the details of
Nimble, BtrBlocks, and also the recent Bullion research format from
UMD/ByteDance -- and I've been consulting/advising some of the research
work that's been referenced ).

Firstly, I 100% agree that documenting implementation support and details,
and cross-compatibility is essential. It would have been better for Parquet
to have integration tests from day one between Impala and parquet-mr, but
this never happened and so there was some initial impedance mismatch
between the two halves of the initial Parquet community going back to the
early days. When I started working on Parquet in 2015, the motivation was
mainly to fill the urgent need to be able to read these files from C++ for
use in Python (and eventually R and other C++ consuming languages).

As far as the issues in Parquet:

- The all-or-nothing footer decoding for datasets with large schemas or
many row groups has always been problematic (I've been asked to present
quantitative evidence to support this "problematic" statement so I will try
to make some!). So I think any work that does not make it much cheaper to
read a single column from a single row group is very nearly dead on arrival
-- I am not sure how you fully make this problem go away in generality
without doing away with Thrift at the footer level, but at that point you
are making such a disruptive change that why not try to fix some other
problems as well? If you go down that rabbit hole, you have created a new
file format that is no longer Parquet, and so calling it ParquetV3 is
probably misleading.

- Parquet's data page format has worked well over time, but aside from
fixing the metadata overhead issue, the data page itself needs to be
extensible. There is DATA_PAGE_V2, but structurally it is the same as
DATA_PAGE{_V1} with the repetition and definition levels kept outside of
the compressed portion. You can kind of think of Parquet's data page
structure as one possible choice of options in a general purpose nested
encoding scheme (most implementations do dictionary+rle and falls back on
plain encoding when the dictionary exceeds a certain size). We could create
a DATA_PAGE_V3 that allows for an whole alternate -- and even pluggable --
encoding scheme, without changing the metadata, and this would be valuable
to the Parquet community, even if most mainstream Parquet users (e.g.
Spark) will opt not to use it for a period of some years for compatibility
reasons.

- Another problem that I haven't seen mentioned but maybe I just missed it
is that Parquet is very painful to decode on accelerators like GPUs. RAPIDS
has created a CUDA implementation of Parquet decoding (including decoding
the Thrift data page headers on the GPU), but there are two primary
problems 1) there is metadata that is necessary for control flow on the
host side within the ColumnChunk in the row group and 2) there are not
sufficient memory preallocation hints -- how much memory you need to
allocate to fully decode a data page. This is also discussed in
https://github.com/facebookincubator/nimble/discussions/50

Personally, I struggle to see how the metadata issues are fixable -- at
least in a satisfactory fashion where we could get behind calling something
ParquetV3 when it would basically be a new file format masquerading as a
major version of an existing file format. It also adds a lot of
implementation complexity for anyone setting out to support "Parquet".

I think there is significant value in developing + researching accelerated
"codecs" (basically, new data page formats -- think about how h.264 and
h.265 have superseded MPEG-2 in video encoding) and finding a way to
incorporate them into Parquet, e.g. with a new DATA_PAGE_V3 page type or
similar. It would be ideal for Parquet and its implementations to continue
to improve.

That said, it's unclear that Parquet as a file container for encoded data
can be evolved to satisfactorily resolve all of the above issues, and I
don't think it needs to. It seems inevitable that we will end up with new
file containers and implementations, but the ideal scenario would be to
develop reusable "codec" libraries (like the nested encoding scheme in
Nimble or in BtrBlocks -- they're very similar) and then use them in
multiple places.

Anyway, it's good to see many opinions on this and I look forward to
continued dialogue.

Thanks
Wes

On Wed, May 15, 2024 at 7:56 AM Steve Loughran <ste...@cloudera.com.invalid>
wrote:

> On Tue, 14 May 2024 at 17:48, Julien Le Dem <jul...@apache.org> wrote:
>
> > +1 on Micah starting a doc and following up by commenting in it.
> >
>
> +maybe some conf call where people of interest can talk about it.
>
>
>
> >
> > @Raphael, Wish Maple: agreed that changing the metadata representation is
> > less important. Most engines can externalize and index metadata in some
> > way.
>
>
> works if queries against specific tables are always routed to those
> servers, the indices fit in memory and the servers stay up. once things
> become more agile that doesn't hold any more.
>
> This is why I've not investigated the idea of having the filesystem
> connector (s3a, abfs...) cache footers to local fs across multiple
> streams/between opening files, even as they now all move to support some
> form of footer caching to boost ORC/Parquet performance for apps which seek
> to the end repeatedly. The larger the worker pool: lower probability of
> reuse; the more files you have the more space any caching takes up.
>
>
> > It is an option to propose a standard way to do it without changing
> > the format.
>
>
> +1
>
>
> > Adding new encodings or make existing encodings more
> > parallelizable is something that needs to be in the format and more
> useful.
> >
>
> One of the things I'd like to see from Micah's work is some list of what
> new data types and encodings people think are needed.
>
>
>
>
>
> >
> > On Tue, May 14, 2024 at 9:26 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > > On Mon, 13 May 2024 16:10:24 +0100
> > > Raphael Taylor-Davies
> > > <r.taylordav...@googlemail.com.INVALID>
> > > wrote:
> > > >
> > > > I guess I wonder if rather than having a parquet format version 2, or
> > > > even a parquet format version 3, we could just document what
> features a
> > > > given parquet implementation actually supports. I believe Andrew
> > intends
> > > > to pick up on where previous efforts here left off.
> > >
> > > I also believe documenting implementation status is strongly desirable,
> > > regardless of whether the discussion on "V3" goes anywhere.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>

Reply via email to