raulcd commented on PR #88:
URL: https://github.com/apache/parquet-testing/pull/88#issuecomment-3035467268

   That's a good question, it seems there's no truncation on those cases:
   ```
   $ java -jar parquet-cli/target/parquet-cli-1.16.0-SNAPSHOT-runtime.jar  cat 
/home/raulcd/code/parquet_truncate_file_generator/binary_truncated_min_max.parquet
   {"a": "Blart Versenwald III", "b": "Blart Versenwald III"}
   {"a": "Alice Johnson", "b": "Alice Johnson"}
   {"a": "Bob Smith", "b": "Bob Smith"}
   {"a": "Charlie Brown", "b": "Charlie Brown"}
   {"a": "Diana Prince", "b": "Diana Prince"}
   {"a": "Edward Norton", "b": "Edward Norton"}
   {"a": "Fiona Apple", "b": "Fiona Apple"}
   {"a": "George Lucas", "b": "George Lucas"}
   {"a": "Helen Keller", "b": "Helen Keller"}
   {"a": "Ivan Drago", "b": "Ivan Drago"}
   {"a": "Julia Roberts", "b": "Julia Roberts"}
   {"a": "🚀Kevin Bacon", "b": "ð\u009F\u009A\u0080Kevin Bacon"}
   {"a": "FFFF_binary", "b": "ÿÿ\u0001\u0002"}
   ```
   and
   ```
   $ java -jar parquet-cli/target/parquet-cli-1.16.0-SNAPSHOT-runtime.jar  cat 
/meta/raulcd/code/parquet_truncate_file_generator/binary_truncated_min_max.parquet
   
   File path:  
/home/raulcd/code/parquet_truncate_file_generator/binary_truncated_min_max.parquet
   Created by: parquet-rs version 55.1.0
   Properties:
     ARROW:schema: 
/////5QAAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAIAAAA8AAAABAAAANz///8UAAAADAAAAAAAAAQMAAAAAAAAAMz///8BAAAAYgAAABAAFAAQAAAADwAEAAAACAAQAAAAGAAAAAwAAAAAAAAFEAAAAAAAAAAEAAQABAAAAAEAAABhAAAA
   Schema:
   message arrow_schema {
     required binary a (STRING);
     required binary b;
   }
   
   
   Row group 0:  count: 13  40.62 B records  start: 4  total(compressed): 528 B 
total(uncompressed):528 B 
   
--------------------------------------------------------------------------------
      type      encodings count     avg size   nulls   min / max
   a  BINARY    _ BB_     13        21.00 B    0       "Al" / "🚀Kevin Bacon"
   b  BINARY    _ BB_     13        19.62 B    0       "0x416C" / "0xFFFF0102"
   
   ```
   The change for the file generation on the rust snippet for this result has 
only been:
   ```
   $ git diff
   diff --git a/src/main.rs b/src/main.rs
   index 7bfbee1..7d58d53 100644
   --- a/src/main.rs
   +++ b/src/main.rs
   @@ -26,12 +26,15 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
            "Helen Keller",
            "Ivan Drago",
            "Julia Roberts",
   -        "Kevin Bacon"
   +        "🚀Kevin Bacon"
        ];
   -    let raw_binary_values: Vec<Vec<u8>> = raw_string_values
   +    let mut raw_binary_values: Vec<Vec<u8>> = raw_string_values
            .iter()
            .map(|s| s.as_bytes().to_vec())
            .collect();
   +
   +    // Add binary data starting with 0xFFFF
   +    raw_binary_values.push(vec![0xFF, 0xFF, 0x01, 0x02]);
        let raw_binary_value_refs = raw_binary_values
            .iter()
            .map(|x| x.as_slice())
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to