[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Thu, 07 Sep 2023 05:58:48 -0700


etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318565796



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,73 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or column
+  * chunk.
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory
+  *
+  *   2. For filter push-down on nulls at various levels of nested
+  *   structures and list lengths.
+  */
+struct RepetitionDefinitionLevelHistogram {
+   /**
+    * When present, there is expected to be one element corresponding to each
+    * repetition (i.e. size=max repetition_level+1) where each element
+    * represents the number of times the repetition level was observed in the
+    * data.
+    *
+    * This field may be omitted if max_repetition_level is 0.
+    **/
+   1: optional list<i64> repetition_level_histogram;
+   /**
+    * Same as repetition_level_histogram except for definition levels.
+    *
+    * This field may be omitted if max_definition_level is 0 or 1.
+    **/
+   2: optional list<i64> definition_level_histogram;
+ }
+
+/**
+ * A structure for capturing metadata for estimating the unencoded,
+ * uncompressed size of data written. This is useful for readers to estimate
+ * how much memory is needed to reconstruct data in their memory model and for
+ * fine grained filter pushdown on nested structures (the histogram contained
+ * in this structure can help determine the number of nulls at a particular
+ * nesting level).
+ *
+ * Writers should populate all fields in this struct except for the exceptions
+ * listed per field.
+ */
+struct SizeStatistics {
+   /**
+    * The number of physical bytes stored for BYTE_ARRAY data values assuming
+    * no encoding. This is exclusive of the bytes needed to store the length of
+    * each byte array. In other words, this field is equivalent to the `(size
+    * of PLAIN-ENCODING the byte array values) - (4 bytes * number of values
+    * written)`. To determine unencoded sizes of other types readers can use
+    * schema information multiplied by the number of non-null and null values.
+    * The number of null/non-null values can be inferred from the histograms
+    * below.
+    *
+    * For example, if a column chunk is dictionary-encoded with dictionary
+    * ["a", "bc", "cde"], and a data page contains the indices [0, 0, 1, 2],
+    * then this value for that data page should be 7 (1 + 1 + 2 + 3).
+    *
+    * This field should only be set for types that use BYTE_ARRAY as their
+    * physical type.
+    */
+   1: optional i64 unencoded_byte_array_data_bytes;

Review Comment:
   > Only if you intend to read the entire page in one go? Are there readers 
that do this?
   
   @tustvold Yes, there are :) I am currently using a library called libcudf to 
process parquet files on a GPU. To make efficient use of the GPUs resources, 
you find yourself processing at the scale of thousands of pages at a time. As 
pages are decoded, the data is written to GPU RAM resident dataframes patterned 
on Arrow.  Due to the random order in which pages might be processed, the 
storage for these dataframes needs to be allocated up front, and the page data 
written to the correct location.  For fixed-width data this isn't such a heavy 
lift, but for byte arrays you need to know exactly how much data you'll have 
when fully materialized in order to do this. When I first started using the 
library, the smallest unit of work was the row group.  Needless to say it is 
easy for a row group to explode in size, rendering it impossible to fully 
materialize in the constrained environment of a GPU. A batched reader has 
recently been added to the library which will read row ranges that are kn
 own to fit in RAM, but as currently implemented it needs to do a pass through 
the data to get precise per-page sizing information to be able to figure out 
the row boundaries that will fit. For dictionary encoded byte arrays, it is 
necessary to fully decode the indices to find the length of each record in the 
page and sum these lengths to get the unencoded page size. For DELTA_BYTE_ARRAY 
one needs to decode the prefix and suffix lengths fully to get the same 
information. For PLAIN encoding, assuming you need to read the entire page you 
can simply subtract `4 * num_values` from the page size to get the materialized 
size, but to read a partial page you have to step through encoded lengths, 
skipping the inline byte data. In addition to this, I am using libcudf to 
perform compactions on sets of sorted parquet files that comprise a log 
structured merge tree. Here too, we cannot fit all of the files in GPU RAM for 
processing, so we have to each of the input files in stripes, where each str
 ipe is a range on the keying column(s). Knowing the unencoded page sizes up 
front is a must to determine the most efficient key ranges to operate over. 
   
   All this is to say that in my case, having the 
`unencoded_byte_array_data_bytes` field available at the page level is a life 
saver in terms of reducing the amount of preprocessing work that needs to be 
done to efficiently read parquet on the GPU. As an aside, I'll mention that the 
first thing I wanted to add to libcudf when I became aware of it was the 
ability to write the Column and Offset indexes, since my project makes heavy 
use of them. Paradoxically, when I tried to use the indexes to do page pruning, 
I found that while I could reduce the amount of file data read dramatically, my 
overall execution times did not decrease much...this was due to the fact that 
it takes roughly as much time to decode 1000 pages as it does a few dozen. Much 
of that is due to the fact that with a 20000 row limit on data page size, you 
can wind up with data pages in the 10s of kB range, but dictionaries in the 
megabytes. You can decompress a lot of small data pages in the time it takes to 
handle one 
 large dictionary page.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to