Is there a JIRA for this? Would be useful to capture the comments in the
JIRA. Note that the document itself is not comment-able as it is shared
with view-only permissions.

Some thoughts in no particular order-
1) The Page based statistical approach is likely to run into trouble with
the encoding used for Parquet fields especially RLE which drastically
changes the size of the field. So pageSize/numValues is going to be wildly
inaccurate with RLE.
2) Not sure where you were going with the predicate pushdown section and
how it pertains to your proposed batch sizing.
3) Assuming that you go with the average batch size calculation approach,
are you proposing to have a Parquet scan specific overflow implementation?
Or are you planning to leverage the ResultSet loader mechanism? If you plan
to use the latter, it will need to be enhanced to handle a bulk chunk as
opposed to a single value at a time. If not using the ResultSet loader
mechanism, why not (you would be reinventing the wheel) ?
4) Parquet page level stats are probably not reliable. You can assume page
size (compressed/uncompressed) and value count are accurate, but nothing
else.

Also note that memory allocations by Netty greater than the 16MB chunk size
are returned to the OS when the memory is free'd. Both this document and
the original document on memory fragmentation state incorrectly that such
memory is not released back to the OS. A quick thought experiment - where
does this memory go if it is not released back to the OS?



On Fri, Feb 9, 2018 at 7:12 AM, salim achouche <sachouc...@gmail.com> wrote:

> The following document
> <https://docs.google.com/document/d/1A6zFkjxnC_-
> 9RwG4h0sI81KI5ZEvJ7HzgClCUFpB5WE/edit?ts=5a793606#>
> describes
> a proposal for enforcing batch sizing constraints (count and memory) within
> the Parquet Reader (Flat Schema). Please feel free to take a look and
> provide feedback.
>
> Thanks!
>
> Regards,
> Salim
>

Reply via email to