lyang24 commented on code in PR #9317:
URL: https://github.com/apache/arrow-rs/pull/9317#discussion_r2762727115


##########
parquet/src/arrow/array_reader/byte_view_array.rs:
##########
@@ -696,10 +701,18 @@ impl ByteViewArrayDecoderDelta {
                 Ok(())
             })?
         } else {
-            // utf8 validation buffer has only short strings. These short
-            // strings are inlined into the views but we copy them into a
-            // contiguous buffer to accelerate validation.®
-            let mut utf8_validation_buffer = Vec::with_capacity(4096);
+            // Reuse the UTF-8 validation buffer - clear it but keep capacity.
+            // Short strings (≤12 bytes) are inlined into views but copied here
+            // for batch validation to accelerate UTF-8 checking.
+            //
+            // Apply a soft cap to prevent unbounded memory retention if a
+            // pathological batch caused excessive growth. 64KB is generous
+            // for short string validation while preventing memory pinning.
+            if self.utf8_validation_buffer.capacity() > 64 * 1024 {
+                self.utf8_validation_buffer = Vec::new();
+            } else {
+                self.utf8_validation_buffer.clear();
+            }

Review Comment:
   good catch, i removed all the defensive code instead preallocate with 4096 
at the beginning now we will never have the vec with less than 4096 capacity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to