wgtmac commented on PR #35825:
URL: https://github.com/apache/arrow/pull/35825#issuecomment-1590317114

   > I have modified your test case to first allocate a raw C buffer, it does 
not crash on string creation anymore, but some other issues appear along the 
way. The first I found is that it tries to use the 32 bit StringBuilder, in 
which the length on `StringBuilder::Append` ends up overflowing.
   > 
   > The template type in `ArrayFromVector` can be modified to use 
`LargeStringType`, in this way it does not overflow. But then reaches "final 
error": `'writer->WriteTable(*table)' failed with Invalid: Parquet cannot store 
strings with size 2GB or more`
   
   Yes, that probably means we need to modify writer code to write acceptable 
binaries or slightly change the data to write.
   
   Considering the effort to create a test case, I think it is reasonable to 
prepare a test file in parquet-testing (so other implementations get the chance 
to verify their capability of reading it). But I think we should add more 
types: at least `string`, `list<string>` and `map<string,int>` in both 
dictionary-encoded and plain-encoded form?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to