Hi Antoine,

I do not have too much to add just hate getting no replies on the dev list.
Parquet-mr doesn't have a release with thrift 0.14+ yet. (The latest
release 1.12.1 went out with 0.13.0.) I don't know how common a >100MB file
footer is. Since we read the whole footer at once to memory and pass it to
Thrift to parse it does not seem to be a memory allocation issue but a CPU
processing issue. If the footer is too large the parsing might require
significant efforts. But if it is a valid Parquet file we have to do it
anyway if not we don't know anything without parsing the footer. Also, I am
not a Thrift expert but I think if the purpose is to forge a Thrift
structure to overload CPU the simple limit of the size would not be enough.
So, I would say we should extend this limit to the maximum (~2GB in java)
if we think the 100MB is not enough.

Cheers,
Gabor

On Mon, Sep 27, 2021 at 11:43 AM Antoine Pitrou <[email protected]> wrote:

>
> Ping.  Does nobody really have any experience with or opinion on this?
>
> I'm going to assume it's ok to disable this security check.
>
> Regards
>
> Antoine.
>
>
>
> On Thu, 16 Sep 2021 17:54:23 +0200
> Antoine Pitrou <[email protected]> wrote:
> > Hello,
> >
> > On Mon, 13 Sep 2021 16:08:19 +0200
> > Antoine Pitrou <[email protected]> wrote:
> > >
> > > My initial fix is to simply remove the limitation.  That is based on
> > > the interpretation that the "message size" is simply the encoded size
> > > of a Thrift payload.  Since we load the Thrift message entirely in
> > > memory from the Parquet file, based on what the Parquet metadata says,
> > > the fact that another size is recorded in the Thrift message shouldn't
> > > ideally be a problem.  But of course, that feels a bit unsatisfactory
> > > (I cannot say for sure whether a problem exists or not).
> > > https://github.com/apache/arrow/pull/11123
> >
> > I'm following up now that I've read through the relevant Thrift C++
> > transport implementations.  I'm reasonably convinced that my analysis
> > is correct, as the max message size applies to encoded Thrift bytes,
> > and we already know the encoded.  I still hope to receive an answer
> > from the Thrift community on
> > https://issues.apache.org/jira/browse/THRIFT-5237.
> >
> > Did nobody experience this issue with other Parquet implementations?
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>
>
>
>

Reply via email to