tustvold commented on issue #2444:
URL: https://github.com/apache/arrow-rs/issues/2444#issuecomment-1214405621

   Thank you for the example, I can reproduce this. The parquet file is 
produced with the following base64 encoded arrow schema
   
   ```
   
"/////zABAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAYAAAAAAASABgACAAGAAcADAAAABAAFAASAAAAAAABDxQAAADUAAAACAAAABQAAAAAAAAABAAAAHV1aWQAAAAAAgAAAFQAAAAEAAAAvP///wgAAAAgAAAAFAAAAEFSUk9XOmV4dGVuc2lvbjpuYW1lAAAAABcAAABhcnJvdy5weV9leHRlbnNpb25fdHlwZQAIAAwABAAIAAgAAAAIAAAAJAAAABgAAABBUlJPVzpleHRlbnNpb246bWV0YWRhdGEAAAAAJwAAAIAElRwAAAAAAAAAjAhfX21haW5fX5SMCFV1aWRUeXBllJOUKVKULgAAAAYACAAEAAYAAAAQAAAA"
   ```
   
   I took this and converted it to its raw data with
   
   ```
   echo "..." | base64 --decode > /tmp/output.bin
   ```
   
   I trimmed the first 8 bytes (as they're arrow specific 
   
   ```
   tail -c +9 /tmp/output.bin > /tmp/output-trimmed.bin
   ```
   
   I then decoded the flatbuffer
   
   ```
   flatc --json --raw-binary format/Message.fbs -- /tmp/output-trimmed.bin
   ```
   
   And got the error
   
   ```
     Unable to generate text for output-trimmed
   ```
   
   This fits with the error that the rust implementation is returning, the 
schema has non-UTF-8 data encoded in a string field, which is technically 
illegal.
   
   If I tell flatc to ignore this
   
   ```
   flatc  --allow-non-utf8 --json --raw-binary format/Message.fbs -- 
/tmp/output-trimmed.bin
   ```
   
   I get the decoded data
   
   ```
   $ cat output-trimmed.json 
   {
     version: "V5",
     header_type: "Schema",
     header: {
       fields: [
         {
           name: "uuid",
           nullable: true,
           type_type: "FixedSizeBinary",
           type: {
             byteWidth: 16
           },
           children: [
   
           ],
           custom_metadata: [
             {
               key: "ARROW:extension:metadata",
               value: 
"\x80\u0004\x95\u001C\u0000\u0000\u0000\u0000\u0000\u0000\u0000\x8C\b__main__\x94\x8C\bUuidType\x94\x93\x94)R\x94."
             },
             {
               key: "ARROW:extension:name",
               value: "arrow.py_extension_type"
             }
           ]
         }
       ]
     }
   }
   ```
   
   I'm not really sure what to make of this, KeyValue is defined as
   
   ```
   table KeyValue {
     key: string;
     value: string;
   }
   ```
   
   Which as per the [flatbuffer 
spec](https://google.github.io/flatbuffers/flatbuffers_guide_writing_schema.html)
 must be valid UTF-8. I will try to get some clarity on what is going on here - 
my understanding of the specification is the Rust implementation is correct to 
refuse this schema...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to