etseidl commented on code in PR #9700:
URL: https://github.com/apache/arrow-rs/pull/9700#discussion_r3096368085


##########
parquet/src/arrow/arrow_writer/byte_array.rs:
##########
@@ -582,7 +603,15 @@ where
     }
 
     match &mut encoder.dict_encoder {
-        Some(dict_encoder) => dict_encoder.encode(values, indices),
+        Some(dict_encoder) => {
+            dict_encoder.encode(values, indices);
+            if let Some(counter) = encoder.plain_data_size_counter.as_mut() {
+                for idx in indices {

Review Comment:
   I'm a little worried about performance here. It would be nice if after we've 
collected enough samples and decided on dict vs fallback, we stop gathering 
these statistics.



##########
parquet/src/column/writer/mod.rs:
##########
@@ -746,18 +755,35 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, 
E> {
 
     /// Returns true if we need to fall back to non-dictionary encoding.
     ///
-    /// We can only fall back if dictionary encoder is set and we have 
exceeded dictionary
-    /// size.
-    #[inline]
+    /// The behavior is governed by the `dictionary_fallback` column property.
     fn should_dict_fallback(&self) -> bool {
-        match self.encoder.estimated_dict_page_size() {
-            Some(size) => {
-                size >= self
-                    .props
-                    .column_dictionary_page_size_limit(self.descr.path())
+        let dict_size = match self.encoder.estimated_dict_page_size() {
+            Some(size) => size,
+            None => return false,
+        };
+
+        // First check: dictionary size exceeds limit
+        if dict_size
+            >= self
+                .props
+                .column_dictionary_page_size_limit(self.descr.path())
+        {
+            return true;
+        }
+
+        // Second check, if enabled: the compression heuristic.
+        // For similar logic in parquet-java,
+        // see DictionaryValuesWriter.isCompressionSatisfying
+        if self.encoder.num_values() >= self.dict_fallback_sample_len {

Review Comment:
   `encoder.num_values()` will only be how many values are currently buffered, 
and will be reset when the page is flushed. I'm afraid a large value for the 
sample len will result in this never evaluating true.
   
   I think instead much of the data should live within the plain counter. It 
should know the cumulative size of plain encoded data, as it currently does, 
but it should also maintain the count of values added, and a cumulative value 
for number of bytes of RLE encoded data (perhaps update this at page flush 
time). Then feed in the current size of the dictionary to get make the 
determination.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to