Hi Gang,

thanks for your reply.

On 01.03.23 03:09, Gang Wu wrote:
If at least one record in the beginning 20000 rows is not null, then the 
encoded size will be much better.
That is the workaround I have been using for the past weeks, although my tests show that at least two values are required.

3. If dictionary encoding is in effect, the first page must be a dictionary 
page followed by a set of data pages that are only indices of the dictionary.
[...]
5. By default, the parquet-mr implementation has to decide the encoding of a 
page when it reaches 20000 records.

I agree that this is at the core of the problem; the question is, can this be changed to allow for better encoding decisions in the scenario I described? An all-null page contains just definition and (possibly) repetition levels, no value entries, so there is no need to choose their encoding yet. What are the reasons for forcing the dictionary to be the first page?

Kind Regards

Patrick

Reply via email to