etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318565796
##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,73 @@ enum FieldRepetitionType {
REPEATED = 2;
}
+/**
+ * A histogram of repetition and definition levels for either a page or column
+ * chunk.
+ *
+ * This is useful for:
+ * 1. Estimating the size of the data when materialized in memory
+ *
+ * 2. For filter push-down on nulls at various levels of nested
+ * structures and list lengths.
+ */
+struct RepetitionDefinitionLevelHistogram {
+ /**
+ * When present, there is expected to be one element corresponding to each
+ * repetition (i.e. size=max repetition_level+1) where each element
+ * represents the number of times the repetition level was observed in the
+ * data.
+ *
+ * This field may be omitted if max_repetition_level is 0.
+ **/
+ 1: optional list<i64> repetition_level_histogram;
+ /**
+ * Same as repetition_level_histogram except for definition levels.
+ *
+ * This field may be omitted if max_definition_level is 0 or 1.
+ **/
+ 2: optional list<i64> definition_level_histogram;
+ }
+
+/**
+ * A structure for capturing metadata for estimating the unencoded,
+ * uncompressed size of data written. This is useful for readers to estimate
+ * how much memory is needed to reconstruct data in their memory model and for
+ * fine grained filter pushdown on nested structures (the histogram contained
+ * in this structure can help determine the number of nulls at a particular
+ * nesting level).
+ *
+ * Writers should populate all fields in this struct except for the exceptions
+ * listed per field.
+ */
+struct SizeStatistics {
+ /**
+ * The number of physical bytes stored for BYTE_ARRAY data values assuming
+ * no encoding. This is exclusive of the bytes needed to store the length of
+ * each byte array. In other words, this field is equivalent to the `(size
+ * of PLAIN-ENCODING the byte array values) - (4 bytes * number of values
+ * written)`. To determine unencoded sizes of other types readers can use
+ * schema information multiplied by the number of non-null and null values.
+ * The number of null/non-null values can be inferred from the histograms
+ * below.
+ *
+ * For example, if a column chunk is dictionary-encoded with dictionary
+ * ["a", "bc", "cde"], and a data page contains the indices [0, 0, 1, 2],
+ * then this value for that data page should be 7 (1 + 1 + 2 + 3).
+ *
+ * This field should only be set for types that use BYTE_ARRAY as their
+ * physical type.
+ */
+ 1: optional i64 unencoded_byte_array_data_bytes;
Review Comment:
> Only if you intend to read the entire page in one go? Are there readers
that do this?
@tustvold Yes, there are :) I am currently using a library called libcudf to
process parquet files on a GPU. To make efficient use of the GPUs resources,
you find yourself processing at the scale of thousands of pages at a time. As
pages are decoded, the data is written to GPU RAM resident dataframes patterned
on Arrow. Due to the random order in which pages might be processed, the
storage for these dataframes needs to be allocated up front, and the page data
written to the correct location. For fixed-width data this isn't such a heavy
lift, but for byte arrays you need to know exactly how much data you'll have
when fully materialized in order to do this. When I first started using the
library, the smallest unit of work was the row group. Needless to say it is
easy for a row group to explode in size, rendering it impossible to fully
materialize in the constrained environment of a GPU. A batched reader has
recently been added to the library which will read row ranges that are kn
own to fit in RAM, but as currently implemented it needs to do a pass through
the data to get precise per-page sizing information to be able to figure out
the row boundaries that will fit. For dictionary encoded byte arrays, it is
necessary to fully decode the indices to find the length of each record in the
page and sum these lengths to get the unencoded page size. For DELTA_BYTE_ARRAY
one needs to decode the prefix and suffix lengths fully to get the same
information. For PLAIN encoding, assuming you need to read the entire page you
can simply subtract `4 * num_values` from the page size to get the materialized
size, but to read a partial page you have to step through encoded lengths,
skipping the inline byte data. In addition to this, I am using libcudf to
perform compactions on sets of sorted parquet files that comprise a log
structured merge tree. Here too, we cannot fit all of the files in GPU RAM for
processing, so we have to each of the input files in stripes, where each str
ipe is a range on the keying column(s). Knowing the unencoded page sizes up
front is a must to determine the most efficient key ranges to operate over.
All this is to say that in my case, having the
`unencoded_byte_array_data_bytes` field available at the page level is a life
saver in terms of reducing the amount of preprocessing work that needs to be
done to efficiently read parquet on the GPU. As an aside, I'll mention that the
first thing I wanted to add to libcudf when I became aware of it was the
ability to write the Column and Offset indexes, since my project makes heavy
use of them. Paradoxically, when I tried to use the indexes to do page pruning,
I found that while I could reduce the amount of file data read dramatically, my
overall execution times did not decrease much...this was due to the fact that
it takes roughly as much time to decode 1000 pages as it does a few dozen. Much
of that is due to the fact that with a 20000 row limit on data page size, you
can wind up with data pages in the 10s of kB range, but dictionaries in the
megabytes. You can decompress a lot of small data pages in the time it takes to
handle one
large dictionary page.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]