Hi,

We have a Parquet file with more than 1000 columns of nested types, and the 
columns are sparse, namely most columns per row are nulls.
When writing the Parquet, the performance is very slow on CPU. Profiler shows 
that MessageColumnIORecordConsumer.writeNull is called
recursively and each recursion gets ever larger number of invocations by 
approximately 35X.

The following code in MessageColumnIO.java shows where the problem could be:


    private void writeNull(ColumnIO undefinedField, int r, int d) {


      if (undefinedField.getType().isPrimitive()) {


        columnWriter[((PrimitiveColumnIO)undefinedField).getId()].writeNull(r, 
d);


      } else {


        GroupColumnIO groupColumnIO = (GroupColumnIO)undefinedField;


        int childrenCount = groupColumnIO.getChildrenCount();


        for (int i = 0; i < childrenCount; i++) {


          writeNull(groupColumnIO.getChild(i), r, d);


        }


      }


    }


As red marked, the recursion occurring in the loop seems to cause the explosion 
of the number of invocation calls.

My question is: Since this writeNull is only called for a missing field at a 
level, and all its descendents are known to be missing and their count are 
known from schema, will there be possibly a more efficient way to store the 
information than the current store of all of the descendants' missing indicator?

Or is there a workaround to avoid this "trap" for now?


Thanks for help!

Reply via email to