Hello,

Some Parquet C++ uses have reported that they cannot load some Parquet
files anymore when Thrift 0.14 is being used:
https://issues.apache.org/jira/browse/ARROW-13655

Due to a vulnerability, the Thrift 0.14 libraries introduced a new
setting to set the maximum message size that can be read, with a
default value that's probably large enough for most uses of Thrift,
but is too small for some Parquet files that exist in the wild.
(see https://issues.apache.org/jira/browse/THRIFT-5237)

For now I haven't found any details about the vulnerability.  The
official CVE entry is unfortunately terse:
"""
In Apache Thrift 0.9.3 to 0.13.0, malicious RPC clients could send
short messages which would result in a large memory allocation,
potentially leading to denial of service.
"""
https://nvd.nist.gov/vuln/detail/CVE-2020-13949


My initial fix is to simply remove the limitation.  That is based on
the interpretation that the "message size" is simply the encoded size
of a Thrift payload.  Since we load the Thrift message entirely in
memory from the Parquet file, based on what the Parquet metadata says,
the fact that another size is recorded in the Thrift message shouldn't
ideally be a problem.  But of course, that feels a bit unsatisfactory
(I cannot say for sure whether a problem exists or not).
https://github.com/apache/arrow/pull/11123

Which approach did other implementations take? Did you expose the
maximum message size as (yet another) setting that the user can change
depending on their files?


Note that the possibility of producing a denial of service using
Parquet files isn't new.  Since Parquet supports compression algorithms
(such as ZLIB, etc.), it should be easy to produce a decompression bomb
where a small compressed payload would expand to a very large memory
area.

Regards

Antoine.


Reply via email to