Hello,
Some Parquet C++ uses have reported that they cannot load some Parquet files anymore when Thrift 0.14 is being used: https://issues.apache.org/jira/browse/ARROW-13655 Due to a vulnerability, the Thrift 0.14 libraries introduced a new setting to set the maximum message size that can be read, with a default value that's probably large enough for most uses of Thrift, but is too small for some Parquet files that exist in the wild. (see https://issues.apache.org/jira/browse/THRIFT-5237) For now I haven't found any details about the vulnerability. The official CVE entry is unfortunately terse: """ In Apache Thrift 0.9.3 to 0.13.0, malicious RPC clients could send short messages which would result in a large memory allocation, potentially leading to denial of service. """ https://nvd.nist.gov/vuln/detail/CVE-2020-13949 My initial fix is to simply remove the limitation. That is based on the interpretation that the "message size" is simply the encoded size of a Thrift payload. Since we load the Thrift message entirely in memory from the Parquet file, based on what the Parquet metadata says, the fact that another size is recorded in the Thrift message shouldn't ideally be a problem. But of course, that feels a bit unsatisfactory (I cannot say for sure whether a problem exists or not). https://github.com/apache/arrow/pull/11123 Which approach did other implementations take? Did you expose the maximum message size as (yet another) setting that the user can change depending on their files? Note that the possibility of producing a denial of service using Parquet files isn't new. Since Parquet supports compression algorithms (such as ZLIB, etc.), it should be easy to produce a decompression bomb where a small compressed payload would expand to a very large memory area. Regards Antoine.
