Hi Antoine, I do not have too much to add just hate getting no replies on the dev list. Parquet-mr doesn't have a release with thrift 0.14+ yet. (The latest release 1.12.1 went out with 0.13.0.) I don't know how common a >100MB file footer is. Since we read the whole footer at once to memory and pass it to Thrift to parse it does not seem to be a memory allocation issue but a CPU processing issue. If the footer is too large the parsing might require significant efforts. But if it is a valid Parquet file we have to do it anyway if not we don't know anything without parsing the footer. Also, I am not a Thrift expert but I think if the purpose is to forge a Thrift structure to overload CPU the simple limit of the size would not be enough. So, I would say we should extend this limit to the maximum (~2GB in java) if we think the 100MB is not enough.
Cheers, Gabor On Mon, Sep 27, 2021 at 11:43 AM Antoine Pitrou <[email protected]> wrote: > > Ping. Does nobody really have any experience with or opinion on this? > > I'm going to assume it's ok to disable this security check. > > Regards > > Antoine. > > > > On Thu, 16 Sep 2021 17:54:23 +0200 > Antoine Pitrou <[email protected]> wrote: > > Hello, > > > > On Mon, 13 Sep 2021 16:08:19 +0200 > > Antoine Pitrou <[email protected]> wrote: > > > > > > My initial fix is to simply remove the limitation. That is based on > > > the interpretation that the "message size" is simply the encoded size > > > of a Thrift payload. Since we load the Thrift message entirely in > > > memory from the Parquet file, based on what the Parquet metadata says, > > > the fact that another size is recorded in the Thrift message shouldn't > > > ideally be a problem. But of course, that feels a bit unsatisfactory > > > (I cannot say for sure whether a problem exists or not). > > > https://github.com/apache/arrow/pull/11123 > > > > I'm following up now that I've read through the relevant Thrift C++ > > transport implementations. I'm reasonably convinced that my analysis > > is correct, as the max message size applies to encoded Thrift bytes, > > and we already know the encoded. I still hope to receive an answer > > from the Thrift community on > > https://issues.apache.org/jira/browse/THRIFT-5237. > > > > Did nobody experience this issue with other Parquet implementations? > > > > Regards > > > > Antoine. > > > > > > > > > >
