Hi Gang,
thanks for your reply.
On 01.03.23 03:09, Gang Wu wrote:
If at least one record in the beginning 20000 rows is not null, then the
encoded size will be much better.
That is the workaround I have been using for the past weeks, although my
tests show that at least two values are required.
3. If dictionary encoding is in effect, the first page must be a dictionary
page followed by a set of data pages that are only indices of the dictionary.
[...]
5. By default, the parquet-mr implementation has to decide the encoding of a
page when it reaches 20000 records.
I agree that this is at the core of the problem; the question is, can
this be changed to allow for better encoding decisions in the scenario I
described? An all-null page contains just definition and (possibly)
repetition levels, no value entries, so there is no need to choose their
encoding yet. What are the reasons for forcing the dictionary to be the
first page?
Kind Regards
Patrick