Re: Interest in Parquet V3

Jan Finis Tue, 28 May 2024 01:33:37 -0700

Thanks Micah for driving the effort!

One thing that I didn't see in your 3 points is a discussion about
extensibility, and, depending on whether we will be quite extensible, a way
to specify which features are used by a Parquet file.


Somewhere down the discussion, it was mentioned that we can just use a bit
set for all possible features and types, so any reader can do a quick
comparison to check whether a file might use features not supported by it.

This whole approach would break down if we enabled extensibility, as then
the list of all possible features would not be known statically. Therefore,
we would need to introduce a mechanism in which new extended features can
be marked without clashing with each other (e.g. UUID per feature).

However, this whole discussion is moot if we in the end decide we do not
want extensibility in the first place. So, do we want it? And if so, what
extension points do we want? Should we create a separate discussion thread
for this as well?

Cheers
Jan

Am Di., 28. Mai 2024 um 07:32 Uhr schrieb Micah Kornfield <
emkornfi...@gmail.com>:

> Hi Everyone,
> Just to follow up, conversations on the summary doc [1] have largely slowed
> down.  In my mind I think of roughly three different tracks, and I'll start
> threads to get a sense of who is interested (please be on the lookout for
> discussion threads).  I think as those conversations branch naturally, the
> different tracks can figure out a way to schedule in person syncs as
> necessary.
>
> 1.  Infrastructure/Release Issues.  Goals of this track:
>     - Define how we can get newer features adopted (hopefully some variant
> of what is proposed in the doc).  One main blocker here seems to have
> volunteers to do releases at the right cadence.
>     - Improved testing infrastructure.  Both cross implementation and for
> downstream consumers.  I think Apache Arrow has done a lot of quality work
> in this space so hopefully that can provide some inspiration.
>     - Improved documentation on feature matrix.
>
> 2. Improved footer metadata for wide-schema and many row-groups.  There
> have already been a few proposals (linked in the doc) and I will post these
> in the new thread, but won't do so here to try to unify the discussion in 1
> place in the new thread.
>
> 3. Improved encodings.  I think there are a few areas to think about here:
>    - Improved heuristics for selecting encodings (most people still only
> use V1 encodings).
>    - Improved encodings (an evaluation framework is necessary)
>    - Better supporting indexed/point lookups.
>
> Thanks everyone for a great conversation so far.  Hopefully, we can
> continue to make progress.
>
> -Micah
>
>
> [1]
>
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
>
> On Wed, May 22, 2024 at 2:32 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> > > *A row group can consist of one or many column chunks per column*
> (while
> > > before it could consist of only one)*.*
> >
> > Yes, that would work.  It's not a breaking change and makes good sense as
> > an evolution.
> >
> > I was talking with Jacques Nadeau earlier and he mentioned another
> > solution, which is to have support for a 2-pass writer.  First you write
> > the data with row groups.  Then, in the second pass, you read the data
> one
> > column at a time, and write a file with only 1 row group.  This works too
> > but I prefer your solution as a two-pass write might be an inconvenience
> > for some.
> >
> > > Multiple concurrent decoders can share the same page
> >
> > Yes, this is pretty much required.  Coming from Arrow I'm pretty used to
> > this kind of reference counting concept :).  That being said, your
> overall
> > point (a lance reader requires a lot of complexity to get parallelism) is
> > very true but I don't know that this is a bad thing.
> >
> > > Page sharing assumes that you have a fast (hopefully O(1), at worst
> > >  I'd say O(log n)) way to only read parts of a page. Whether this is
> > >  possible depends on the encoding. E.g., Parquet has encodings where
> this
> > is
> > >  possible (PLAIN) and ones where this is not possible (RLE, DELTA_*,
> > etc.).
> > >  With these encodings, two parallel read threads would need to do
> > duplicate
> > >  work to read only parts of a page.
> >
> > For plain / bit packed / ... pages this is easy, as you mention.
> >
> > RLE is a bit trickier (disclaimer: I have not yet implemented it yet) but
> > should be doable.  Lance has a decode loop that goes through each column
> > and figures out which pages (parts of pages) are needed for the next
> batch
> > (more about this in [1]).  The page decoders are stateful (they are
> created
> > "full" by the scheduler and then slowly drained) and so they keep a
> pointer
> > into the page (e.g. a pointer into the runs so we know which run we are
> > currently decoding and how many values we've already taken from that run)
> > and so it should still be O(1) in the RLE case.
> >
> > General compression / block compression, on the other hand, is pretty
> > annoying :).  I can't easily split the decode of a compressed buffer into
> > two different tasks (shoutout to niyue who's been looking into
> compression
> > and helped me realize this).  My current thinking for compression (and
> > encryption) is to make them a layer in between the I/O and the decoder.
> So
> > basically, I have a "decompression / decrypting" thread pool and I have
> > `schedule -> I/O -> decrypt/decompress -> decode`.  One benefit here is
> > that we effectively parallelize decompression / decryption across
> columns.
> > One downside is that the thread that does the decryption / decompression
> is
> > probably not the same thread that is going to do the decoding and so we
> > can't fuse decompression, decryption, decode, and first-query-pipeline
> into
> > a single pipeline (we would have decryption / decode as one pipeline and
> > decode + first-query-pipeline as the second).
> >
> > [1] https://blog.lancedb.com/splitting-scheduling-from-decoding/
> >
> > On Wed, May 22, 2024 at 12:42 PM Jan Finis <jpfi...@gmail.com> wrote:
> >
> > > Thanks Weston for the answers, very insightful! I appreciate your input
> > > very much.
> > >
> > > Some follow ups :):
> > >
> > >  My point is that one of
> > > > these (compression chunk size) is a file format concern and the other
> > one
> > > > (encoding size) is an encoding concern.
> > >
> > >
> > > The fact that Lance V2 still can have multiple grains is interesting
> and
> > so
> > > far was not clear in our discussion, thanks for the clarification! As
> you
> > > describe, Lance V2 allows for further "sub pages" inside a page.
> Parquet
> > > does not have that. Thus, if we now just drop the restrictions that
> > Parquet
> > > pages (which - at least as of today - do not allow nesting of smaller
> sub
> > > pages) need to be contiguous in Parquet, we would end up with something
> > > that is less powerful than Lance V2. I'm wondering how we can best
> arrive
> > > at something that is as powerful as the concept in Lance V2 without
> > > throwing all Parquet concepts out of the window. Here is my first draft
> > of
> > > how this could look like:
> > >
> > > When you think it through in detail, a Lance V2 page is actually the
> > > equivalent of a Parquet column chunk,* not of a Parquet page,* as the
> > > former is the (preferred) unit of I/O while the latter is the unit of
> > > encoding. And as you describe, a Lance V2 page is the unit of I/O, not
> > the
> > > unit of encoding. That is an important point to keep in mind in our
> > > discussion, as this might otherwise lead to a lot of misunderstandings.
> > >
> > > Thus, to make Parquet column chunks as powerful as Lance V2 pages, we
> > must
> > > allow *multiple column chunks* per column inside a Parquet row group.
> > This
> > > should give us the desired equivalent of the advantages of Lance V2.
> > Thus,
> > > if we want this (and I think we do), we should do the following change
> > for
> > > Parquet V3:
> > >
> > > *A row group can consist of one or many column chunks per column*
> (while
> > > before it could consist of only one)*.*
> > >
> > > We should not say that Parquet pages do not need to be contiguous, as
> > > Parquet pages are *not* the grain of I/O but the grain of encoding.
> > >
> > > Do you agree with this, or did I get something wrong?
> > >
> > > (I see that Lance V2 is even more flexible than that. But I guess this
> > is a
> > > good inbetween solution that allows us to leverage the advantages of
> "non
> > > contiguity" without needing to make all of Parquet's concepts
> pluggable,
> > > which could be too big of a change.)
> > >
> > >
> > > We do parallelize inside a file.
> > >
> > >
> > > Interesting, but this comes with challenges. How do you solve these?
> > >
> > > As page boundaries are not necessarily aligned between columns, this
> > means
> > > that mini batches can start and end in the middle of pages. This comes
> > with
> > > the following challenges:
> > >
> > >    - Multiple concurrent decoders can share the same page (e.g., one
> > >    decodes the first part of it and the other the last part). Thus,
> > either
> > > we
> > >    request it two times from storage, wasting I/O bandwidth, or we make
> > >    readers share I/O requests. So we would need some kind of reference
> > >    counting or a similar technique to check how many decoders still
> need
> > a
> > >    page that has been fetched from I/O before we can free the read
> > memory.
> > >    This is possible, but it increases the complexity of the I/O +
> decode
> > > stack
> > >    considerably. Does the Lance V2 format subsume a reader that is this
> > >    clever? (This is not a real downside, as I am more than happy to
> > > implement
> > >    complex I/O sharing code if it leads to a more efficient format; I
> > just
> > >    wanted to mention this trade-off)
> > >    - Page sharing assumes that you have a fast (hopefully O(1), at
> worst
> > >    I'd say O(log n)) way to only read parts of a page. Whether this is
> > >    possible depends on the encoding. E.g., Parquet has encodings where
> > > this is
> > >    possible (PLAIN) and ones where this is not possible (RLE, DELTA_*,
> > > etc.).
> > >    With these encodings, two parallel read threads would need to do
> > > duplicate
> > >    work to read only parts of a page. Also black box compression (LZ4,
> > > etc.)
> > >    can only decompress a page as a whole. How do you solve this
> challenge
> > > in
> > >    LanceDB?
> > >
> > >
> > > @Steve
> > >
> > > Jan, assuming you are one of the authors of "Get Real: How Benchmarks
> > Fail
> > > > to Represent the Real World", can I get a preprint copy?  As it's
> > > relevant
> > > > here. My acm library access is gone until I upload another 10 Banksy
> > > mural
> > > > photos to wikimedia.
> > > >
> > >
> > > This link should work without any library access:
> > > https://dl.acm.org/doi/pdf/10.1145/3209950.3209952 (I tried from my
> > phone,
> > > which definitely doesn't have any ACM access). You might need to press
> > the
> > > "PDF" button. Let me know if it doesn't work, then I'll try to find a
> > > pre-print.
> > >
> > > Cheers,
> > > Jan
> > >
> > > Am Mi., 22. Mai 2024 um 16:10 Uhr schrieb Steve Loughran
> > > <ste...@cloudera.com.invalid>:
> > >
> > > > On Tue, 21 May 2024 at 22:40, Jan Finis <jpfi...@gmail.com> wrote:
> > > >
> > > > > Thanks Weston for posting here!
> > > > >
> > > > > I appreciate this a lot, as it gives us the opportunity to discuss
> > > modern
> > > > > formats in depth with the authors themselves, who probably know the
> > > > design
> > > > > trade-offs they took best and thus can give us a deeper
> understanding
> > > > what
> > > > > certain features would mean for Parquet.
> > > > >
> > > >
> > > > This is becoming an interesting question.
> > > >
> > > > >
> > > > > I read both your linked posts. I read them with the mindset as if
> > they
> > > > were
> > > > > the documentation for a file format that I myself would need to add
> > to
> > > > our
> > > > > engine, so I always double checked whether I would agree with your
> > > > > reasoning and where I would see problems in the implementation.
> > > > >
> > > > > I ended up with some points where I cannot follow your reasoning,
> > yet,
> > > or
> > > > > where I feel clarification would be good. It would be nice if you
> > could
> > > > go
> > > > > a bit into detail here:
> > > > >
> > > > > Regarding your "parallelism without row groups" post [2]:
> > > > >
> > > > > 1. Do I understand correctly that you basically replace row groups
> > with
> > > > > files. Thus, the task for reading row groups in parallel boils down
> > to
> > > > > reading files in parallel. Your post does *not* claim that the new
> > > format
> > > > > would be able to parallelize *inside* a row group/file, correct?
> > > > >
> > > > > 2. I do not fully understand what the proposed parallelism has to
> do
> > > with
> > > > > the file format. As you mention yourself, files and row groups are
> > > > > basically the same thing. As such, couldn't you do the same "Decode
> > > Based
> > > > > Parallelism" also with Parquet as it is today? E.g., the file
> reader
> > in
> > > > our
> > > > > engine looks basically exactly like what you propose, employing
> what
> > > you
> > > > > call Mini Batches and not reading a whole row group as a whole
> (which
> > > > could
> > > > > lead to out of memory in case a row group contains an insane amount
> > of
> > > > > rows, so it is a big no no anyway for us). It seems that the
> > > shortcomings
> > > > > of the code listed in "Our First Parallel File Reader" is solely a
> > > > > shortcoming of that code, not of the underlying format.
> > > > >
> > > > > Regarding [1]:
> > > > >
> > > > > 3. This one is mostly about understanding your rationales:
> > > > >
> > > > > As one main argument for abolishing row groups, you mention that
> > sizing
> > > > > them well is hard (I fully agree!). But since you replace row
> groups
> > > with
> > > > > files, don't you have the same problem for the file again? Small
> row
> > > > > groups/files are bad due to small I/O requests and metadata
> > explosion,
> > > > > agree! So let's use bigger ones. Here you argue that Parquet
> readers
> > > will
> > > > > load the whole row group into memory and therefore suffer memory
> > > issues.
> > > > > This is a strawman IMHO, as this is just a shortcoming of the
> reader,
> > > not
> > > > > of the format. Nothing in the Parquet spec forces a reader to read
> a
> > > row
> > > > > group at once (and in fact, our implementation doesn't do this for
> > > > exactly
> > > > > the reasons you mentioned). Just like in LanceV2, Parquet readers
> can
> > > opt
> > > > > to read only a few pages ahead of the decoding.
> > > > >
> > > >
> > > > Parquet-mr now supports parallel GET requests on different ranges on
> > s3;
> > > > it'd be even better if object stores supported multiple range
> requests
> > > in a
> > > > single GET.
> > > > Doing all this reading in a single file is more efficient client-side
> > > than
> > > > having different files
> > > > open.
> > > >
> > > >
> > > > > On the writing side, I see your point that a Lance V2 writer never
> > has
> > > to
> > > > > buffer more than a page and this is great! However, this seems to
> be
> > > > just a
> > > > > result of allowing pages to not be contiguous, not of the fact that
> > row
> > > > > groups were abolished. You could still support multiple row groups
> > with
> > > > > non-contiguous pages and reap all the benefits you mention. Your
> post
> > > > > intermingles the two design choices "contiguous pages yes/no" and
> > "row
> > > > > groups as horizontal partitions within a file yes/no". I would
> argue
> > > that
> > > > > the two features are basically fully orthogonal. You can have one
> > > without
> > > > > the other and vice versa.
> > > > >
> > > > > So all in all, do I see correctly that your main argument here
> > > basically
> > > > is
> > > > > "don't force pages to be contiguous!". Doing away with row groups
> is
> > > just
> > > > > added bonus for easier maintenance, as you can just use files
> instead
> > > of
> > > > > row groups.
> > > > >
> > > > >
> > > > > 4. Considering contiguous pages and I/O granularity:
> > > > >
> > > > > The format basically proposes to have pages as the only granularity
> > > > below a
> > > > > file (+ metadata & footer), while Parquet has two granularities:
> Row
> > > > group,
> > > > > or rather Column Chunk, and Page. You argue that a page in Lance V2
> > > > should
> > > > > basically be as big as is necessary for good I/O performance (say,
> 8
> > > MiB
> > > > > for Amazon S3).
> > > >
> > > >
> > > > way bigger, please. small files are so expensive on s3, whenever you
> > > start
> > > > doing any directory operations, bulk copies etc. Same for the other
> > > stores.
> > > >
> > > > Jan, assuming you are one of the authors of "Get Real: How Benchmarks
> > > Fail
> > > > to Represent the Real World", can I get a preprint copy?  As it's
> > > relevant
> > > > here. My acm library access is gone until I upload another 10 Banksy
> > > mural
> > > > photos to wikimedia.
> > > >
> > > > Steve
> > > >
> > >
> >
>

Re: Interest in Parquet V3

Reply via email to