Re: Interest in Parquet V3

Weston Pace Tue, 21 May 2024 19:25:40 -0700

> My point is that one of these (compression chunk size) is a file format
concern and the other one (encoding size) is an encoding concern.


Slight typo :)  I mean "page size" is a file format concern and
"compression chunk size" is an encoding concern.

On Tue, May 21, 2024 at 6:40 PM Weston Pace <weston.p...@gmail.com> wrote:

> Thank you for your questions!  I think your understanding is very solid.
>
> > Do I understand correctly that you basically replace row groups with
> > files. Thus, the task for reading row groups in parallel boils down to
> > reading files in parallel.
>
> Partly.  I recommend files for inter-process parallelism (e.g.
> distributing a query across multiple worker processes).
>
> > Your post does *not* claim that the new format
> > would be able to parallelize *inside* a row group/file, correct?
>
> We do parallelize inside a file.  Imagine a simple file with 3 int32
> columns, 2 million rows per column and 1 8MiB page per column.
>
> First we would issue 3 parallel I/O requests to read those 3 pages.  Once
> those three requests complete we then create decode tasks.  I tend to think
> about things in "query engine" terms.  Query engines often process data in
> relatively small batches.  10k rows is a common batch size (probably
> because 10k TPC-H rows fit nicely inside CPU caches).  For example, duckdb
> uses either 10k or 2k (I think 10k for parquet and 2k for its internal
> format, or maybe it breaks the 10k into 2k size chunks).  Datafusion uses
> 8k chunks.  Acero uses 32k but that is probably too big.
>
> Since we have 2 million rows we can create 200 thread tasks, each with 10k
> rows, which gives us plenty of parallelism.  The first thing these thread
> tasks do is decode the data.  Larger files with more complex columns and
> many pages follow pretty much the same strategy (but it gets quite
> complicated)
>
> > As such, couldn't you do the same "Decode Based
> > Parallelism" also with Parquet as it is today?
>
> Yes, absolutely, and I would recommend it.
>
> > I do not fully understand what the proposed parallelism has to do with
> > the file format.
>
> Nothing, you are correct.  In my blog posts I am generally discussing
> "designing a high performing file reader" and not "designing a new file
> format".  My concerns with the parquet format are only the A/B/C points I
> mentioned in my previous email.
>
> > So all in all, do I see correctly that your main argument here basically
> is
> > "don't force pages to be contiguous!". Doing away with row groups is just
> > added bonus for easier maintenance, as you can just use files instead of
> > row groups.
>
> Absolutely.  As long as pages don't have to be contiguous then I am
> happy.  I'd simply always create parquet files with 1 row group.
>
> > As such, it seems that the two grains that Parquet has benefit us, as
> they
> > give us flexibility of both being able to scan with large requests and
> > doing point accesses without too much read amplification by using small
> > single-page requests.
>
> Again, you are spot on.  Both grains are needed.  My point is that one of
> these (compression chunk size) is a file format concern and the other one
> (encoding size) is an encoding concern.  In fact, there is a third grain,
> which is the size used for zone maps in pushdown indices.  There may be
> many other grains too (e.g. run end encoded arrays need skip tables for
> point lookups)  Parquet forces all of these to be the same value (page
> size).
>
> The solution to this problem in Lance (if you want to do some kind of
> block compression) is to create a single 8MB page.  That page would consist
> of many compressed chunks stored contiguously.  If the compression has
> variable sized blocks (not sure if this is a thing, I don't know
> compression well) then you need to store block sizes in a column metadata
> buffer.  If the blocks are fixed size then you need to store how many rows
> are in each block in a column metadata buffer.  A compression encoder can
> then pick whatever compression chunk size makes the most sense.
>
> So, for example, you could have 64KB compression chunks, zone maps for
> every 1024 rows, and 8MB pages for I/O.
>
>
>
> On Tue, May 21, 2024 at 2:40 PM Jan Finis <jpfi...@gmail.com> wrote:
>
>> Thanks Weston for posting here!
>>
>> I appreciate this a lot, as it gives us the opportunity to discuss modern
>> formats in depth with the authors themselves, who probably know the design
>> trade-offs they took best and thus can give us a deeper understanding what
>> certain features would mean for Parquet.
>>
>> I read both your linked posts. I read them with the mindset as if they
>> were
>> the documentation for a file format that I myself would need to add to our
>> engine, so I always double checked whether I would agree with your
>> reasoning and where I would see problems in the implementation.
>>
>> I ended up with some points where I cannot follow your reasoning, yet, or
>> where I feel clarification would be good. It would be nice if you could go
>> a bit into detail here:
>>
>> Regarding your "parallelism without row groups" post [2]:
>>
>> 1. Do I understand correctly that you basically replace row groups with
>> files. Thus, the task for reading row groups in parallel boils down to
>> reading files in parallel. Your post does *not* claim that the new format
>> would be able to parallelize *inside* a row group/file, correct?
>>
>> 2. I do not fully understand what the proposed parallelism has to do with
>> the file format. As you mention yourself, files and row groups are
>> basically the same thing. As such, couldn't you do the same "Decode Based
>> Parallelism" also with Parquet as it is today? E.g., the file reader in
>> our
>> engine looks basically exactly like what you propose, employing what you
>> call Mini Batches and not reading a whole row group as a whole (which
>> could
>> lead to out of memory in case a row group contains an insane amount of
>> rows, so it is a big no no anyway for us). It seems that the shortcomings
>> of the code listed in "Our First Parallel File Reader" is solely a
>> shortcoming of that code, not of the underlying format.
>>
>> Regarding [1]:
>>
>> 3. This one is mostly about understanding your rationales:
>>
>> As one main argument for abolishing row groups, you mention that sizing
>> them well is hard (I fully agree!). But since you replace row groups with
>> files, don't you have the same problem for the file again? Small row
>> groups/files are bad due to small I/O requests and metadata explosion,
>> agree! So let's use bigger ones. Here you argue that Parquet readers will
>> load the whole row group into memory and therefore suffer memory issues.
>> This is a strawman IMHO, as this is just a shortcoming of the reader, not
>> of the format. Nothing in the Parquet spec forces a reader to read a row
>> group at once (and in fact, our implementation doesn't do this for exactly
>> the reasons you mentioned). Just like in LanceV2, Parquet readers can opt
>> to read only a few pages ahead of the decoding.
>>
>> On the writing side, I see your point that a Lance V2 writer never has to
>> buffer more than a page and this is great! However, this seems to be just
>> a
>> result of allowing pages to not be contiguous, not of the fact that row
>> groups were abolished. You could still support multiple row groups with
>> non-contiguous pages and reap all the benefits you mention. Your post
>> intermingles the two design choices "contiguous pages yes/no" and "row
>> groups as horizontal partitions within a file yes/no". I would argue that
>> the two features are basically fully orthogonal. You can have one without
>> the other and vice versa.
>>
>> So all in all, do I see correctly that your main argument here basically
>> is
>> "don't force pages to be contiguous!". Doing away with row groups is just
>> added bonus for easier maintenance, as you can just use files instead of
>> row groups.
>>
>>
>> 4. Considering contiguous pages and I/O granularity:
>>
>> The format basically proposes to have pages as the only granularity below
>> a
>> file (+ metadata & footer), while Parquet has two granularities: Row
>> group,
>> or rather Column Chunk, and Page. You argue that a page in Lance V2 should
>> basically be as big as is necessary for good I/O performance (say, 8 MiB
>> for Amazon S3). Thus, the Parquet counterpart of a Lance v2 page would
>> actually be - at least in terms of I/O efficiency - a Parquet Column
>> Chunk.
>> A Parquet page can instead be quite small, as it does not need to be the
>> grain of the I/O but just the grain of the encoding.
>>
>> The fact that Parquet has these two grains has advantages when considering
>> a scan vs. a point look-up. When doing a scan, we can load whole column
>> chunks at once, having large I/O requests to not overwhelm the I/O with
>> too
>> many requests. When doing a point access, we can use the page & offset
>> index to find and load only the one page (per column) in which the row we
>> are looking for is located.
>>
>> As such, it seems that the two grains that Parquet has benefit us, as they
>> give us flexibility of both being able to scan with large requests and
>> doing point accesses without too much read amplification by using small
>> single-page requests. With Lance V2, either I make large pages to make
>> scans take fewer I/O requests (e.g., 8 MiB), but then I will have large
>> read amplification for point accesses, or I make my pages quite small to
>> benefit point accesses, but then scans will need to emit tons of I/O
>> operations, which is what you are trying to avoid. How does Lance V2 solve
>> this challenge? Or did I understand the format wrong here?
>>
>> Cheers,
>> Jan
>>
>> Am Di., 21. Mai 2024 um 18:07 Uhr schrieb Weston Pace <
>> weston.p...@gmail.com
>> >:
>>
>> > As the author of one of these new formats I'll chime in.  The main
>> issues I
>> > have with parquet are:
>> >
>> > A. Pages in a column chunk must be contiguous (this is Lance's biggest
>> > issue with parquet)
>> > B. Encodings should be extensible
>> > C. Flexibility in what is considered data / metadata
>> >
>> > I outline my reasoning for these in [1] and so I'll avoid repeating that
>> > here.  I think B has been discussed pretty thoroughly in this thread.
>> >
>> > As for C, a format should be flexible, and then it is pretty
>> > straightforward.  If a file is likely to be used in "search" (very
>> > selective filters, ability to cache, etc.) then lots of data should be
>> put
>> > in the column metadata.  If the file is mostly for cold full scans then
>> > almost nothing should go in column metadata (either don't write the
>> > metadata at all or, I guess, you can put it in the data pages).  The
>> format
>> > shouldn't force a choice.
>> >
>> > Personally, I am more excited about A than I am about B & C (though I do
>> > think both B & C should be addressed if going through the trouble of a
>> new
>> > format).  Addressing A lets us get rid of row groups, allows for APIs
>> such
>> > as "array-at-a-time writing", lets us make large data pages, and
>> generally
>> > leads to more foolproof files.
>> >
>> > I agree with Andrew that any discussion of B & C is usually based on
>> > assumptions rather than concrete measurements of reader performance.  In
>> > the scattered profiling I've done of parquet-cpp and parquet-rs I've
>> found
>> > that poor parquet reader performance typically has very little to do
>> with B
>> > & C.  Actually, I would guess that the most widespread (though not
>> > necessarily most important) obstacle to parquet has been user knowledge.
>> > To get the best performance from a reader users need to be familiar not
>> > just with the format but also with the features available in a
>> particular
>> > reader.  I think simplifying the user experience should be a secondary
>> goal
>> > for any new changes.
>> >
>> > At the risk of arrogant self-promotion I would recommend people read [1]
>> > for inspiration if nothing else.  I'm also hoping to detail design
>> > decisions and tradeoffs that we come across (starting in [2] and
>> continuing
>> > throughout the summer).
>> >
>> > [1] https://blog.lancedb.com/lance-v2/
>> > [2]
>> >
>> >
>> https://blog.lancedb.com/file-readers-in-depth-parallelism-without-row-groups/
>> >
>> > On Mon, May 20, 2024 at 11:06 AM Parth Chandra <par...@apache.org>
>> wrote:
>> >
>> > > Hi Parquet team,
>> > >
>> > >  It is very exciting to see this effort. Thanks Micah for starting
>> this.
>> > >
>> > >  For most use case that our team sees the broad areas for improvement
>> > > appear to be -
>> > >    1) Optimizing for cloud storage (latency is high, seeks are
>> expensive)
>> > >    2) Optimized metadata reading - we've seen 30% (sometimes more) of
>> > > Spark's scan operator time spent in reading footers.
>> > >    3) Anything that improves support for data lakes.
>> > >
>> > >   Also I'll be happy to help wherever I can.
>> > >
>> > > Parth
>> > >
>> > > On Sun, May 19, 2024 at 10:59 AM Xinli shang <sha...@uber.com.invalid
>> >
>> > > wrote:
>> > >
>> > > > Sorry I am late to the party! It's great to see this discussion!
>> Thank
>> > > you
>> > > > everyone for the many good points and thank you, Micah, for starting
>> > the
>> > > > discussion and putting it together into a document, which is very
>> > > helpful!
>> > > > I agree with most of the points we discussed above, and we need to
>> > > improve
>> > > > Parquet and sometimes even speed up to catch up with industry
>> changes.
>> > > >
>> > > > With that said, we need people to work on it, as Julien mentioned.
>> The
>> > > > document
>> > > > <
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
>> > > > >
>> > > > that Micah created covers pretty much everything we discussed here.
>> I
>> > > > encourage all of us to contribute by raising questions, providing
>> > > > suggestions, adding missing functionality, etc. Once we reach a
>> > consensus
>> > > > on each topic, we can create different tracks and working streams to
>> > kick
>> > > > off the implementations.
>> > > >
>> > > > I believe continuously improving Parquet would benefit the industry
>> > more
>> > > > than creating a new format, which could add friction. These
>> improvement
>> > > > ideas are exciting opportunities. If you, your team members, or
>> friends
>> > > > have time and interest, please encourage them to contribute.
>> > > >
>> > > > Our Parquet community meeting is next week, on May 28, 2024. We can
>> > have
>> > > > discussions there if you can join. Currently, it is scheduled for
>> 7:00
>> > am
>> > > > PDT, but I can change it according to the majority's availability.
>> > > >
>> > > > On Fri, May 17, 2024 at 3:58 PM Rok Mihevc <rok.mih...@gmail.com>
>> > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I've discussed with my colleagues and we would dedicate two
>> engineers
>> > > for
>> > > > > 4-6 months on tasks related to implementing the format changes.
>> We're
>> > > > > already active in design discussions and can help with C++, Rust
>> and
>> > C#
>> > > > > implementations. I thought it'd be good to state this explicitly
>> > FWIW.
>> > > > >
>> > > > > Our main areas of interest are efficient reads for tables with
>> wide
>> > > > schemas
>> > > > > and faster random rowgroup access [1].
>> > > > >
>> > > > > To workaround the wide schemas issue we actually implemented an
>> > > internal
>> > > > > tool [3] for storing index information into a separate file which
>> > > allows
>> > > > > for reading only the necessary subset of metadata. We would offer
>> > this
>> > > > > approach for consideration as a possible approach to solve the
>> wide
>> > > > schema
>> > > > > problem.
>> > > > >
>> > > > > [1] https://github.com/apache/arrow/issues/39676
>> > > > > [2] https://github.com/G-Research/PalletJack
>> > > > >
>> > > > > Rok
>> > > > >
>> > > > > On Sun, May 12, 2024 at 12:59 AM Micah Kornfield <
>> > > emkornfi...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Parquet Dev,
>> > > > > > I wanted to start a conversation within the community about
>> working
>> > > on
>> > > > a
>> > > > > > new revision of Parquet.  For context there have been a bunch of
>> > new
>> > > > > > formats [1][2][3] that show there is decent room for improvement
>> > > across
>> > > > > > data encodings and how metadata is organized.
>> > > > > >
>> > > > > > Specifically, in a new format revision I think we should be
>> > thinking
>> > > > > about
>> > > > > > the following areas for improvements:
>> > > > > > 1.  More efficient encodings that allow for data skipping and
>> SIMD
>> > > > > > optimizations.
>> > > > > > 2.  More efficient metadata handling for deserialization and
>> > > projection
>> > > > > to
>> > > > > > address areas when metadata deserialization time is not trivial
>> > [4].
>> > > > > > 3.  Possibly thinking about different encodings instead of
>> > > > > > repetition/definition for repeated and nested field
>> > > > > > 4.  Support for optimizing semi-structured data (e.g. JSON or
>> > Variant
>> > > > > type)
>> > > > > > that can shred elements into individual columns (a recent
>> thread in
>> > > > > Iceberg
>> > > > > > mentions doing this at the metadata level [5])
>> > > > > >
>> > > > > > I think the goals of V3 would be to provide existing API
>> > > compatibility
>> > > > as
>> > > > > > broadly as possible (possibly with some performance loss) and
>> > expose
>> > > > new
>> > > > > > API surface areas where appropriate to make use of new elements.
>> > New
>> > > > > > encodings could be backported so they can be made use of without
>> > > > metadata
>> > > > > > changes.  I think unfortunately that for points 2 and 3 we would
>> > want
>> > > > to
>> > > > > > break file level compatibility.  More thought would be needed to
>> > > > consider
>> > > > > > whether 4 could be backported effectively.
>> > > > > >
>> > > > > > This is a non-trivial amount of work to get good coverage across
>> > > > > > implementations, so before putting together more formal
>> proposal it
>> > > > would
>> > > > > > be nice to know if:
>> > > > > >
>> > > > > > 1.  If there is an appetite in the general community to consider
>> > > these
>> > > > > > changes
>> > > > > > 2.  If anybody from the community is interested in
>> collaborating on
>> > > > > > proposals/implementation in this area.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Micah
>> > > > > >
>> > > > > > [1] https://github.com/maxi-k/btrblocks
>> > > > > > [2] https://github.com/facebookincubator/nimble
>> > > > > > [3] https://blog.lancedb.com/lance-v2/
>> > > > > > [4] https://github.com/apache/arrow/issues/39676
>> > > > > > [5]
>> > https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > Xinli Shang
>> > > >
>> > >
>> >
>>
>

Re: Interest in Parquet V3

Reply via email to