wgtmac commented on PR #35825: URL: https://github.com/apache/arrow/pull/35825#issuecomment-1590317114
> I have modified your test case to first allocate a raw C buffer, it does not crash on string creation anymore, but some other issues appear along the way. The first I found is that it tries to use the 32 bit StringBuilder, in which the length on `StringBuilder::Append` ends up overflowing. > > The template type in `ArrayFromVector` can be modified to use `LargeStringType`, in this way it does not overflow. But then reaches "final error": `'writer->WriteTable(*table)' failed with Invalid: Parquet cannot store strings with size 2GB or more` Yes, that probably means we need to modify writer code to write acceptable binaries or slightly change the data to write. Considering the effort to create a test case, I think it is reasonable to prepare a test file in parquet-testing (so other implementations get the chance to verify their capability of reading it). But I think we should add more types: at least `string`, `list<string>` and `map<string,int>` in both dictionary-encoded and plain-encoded form? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
