It is certainly possible to avoid the recursion and improve this. As you mentioned, the schema is known in advance. Pull requests are welcome if you want to take a stab at it.
On Tue, Oct 21, 2014 at 9:43 AM, Yan Zhou.sc <[email protected]> wrote: > Hi, > > We have a Parquet file with more than 1000 columns of nested types, and > the columns are sparse, namely most columns per row are nulls. > When writing the Parquet, the performance is very slow on CPU. Profiler > shows that MessageColumnIORecordConsumer.writeNull is called > recursively and each recursion gets ever larger number of invocations by > approximately 35X. > > The following code in MessageColumnIO.java shows where the problem could > be: > > > private void writeNull(ColumnIO undefinedField, int r, int d) { > > > if (undefinedField.getType().isPrimitive()) { > > > > columnWriter[((PrimitiveColumnIO)undefinedField).getId()].writeNull(r, d); > > > } else { > > > GroupColumnIO groupColumnIO = (GroupColumnIO)undefinedField; > > > int childrenCount = groupColumnIO.getChildrenCount(); > > > for (int i = 0; i < childrenCount; i++) { > > > writeNull(groupColumnIO.getChild(i), r, d); > > > } > > > } > > > } > > > As red marked, the recursion occurring in the loop seems to cause the > explosion of the number of invocation calls. > > My question is: Since this writeNull is only called for a missing field at > a level, and all its descendents are known to be missing and their count > are known from schema, will there be possibly a more efficient way to store > the information than the current store of all of the descendants' missing > indicator? > > Or is there a workaround to avoid this "trap" for now? > > > Thanks for help! >
