I‘m using ParquetFileReader/ParquetPageReader to scan parquet files and apply a
projection. This is working well for primitive column types but I’m running
into an issue when trying at add support for arrays and could use some help.
I’m retrieving the schema like this:
val r = new ParquetFileReader(file, options)
val schema: MessageType = r.getFileMetaData.getSchema
I’m then filtering the schema on column name to get the column descriptors.
Let’s say the field I am looking for is “foo” .. in the case of an array I get
a descriptor with the path { “foo” / “list” / “element” }.
I’m building a projection like this
val projectionBuilder = Types.buildMessage()
for (col <- projectedColumnDefs) {
projectionBuilder.addField(col.getPrimitiveType)
}
projectionBuilder.named("projection")
The problem is that this projection then ends up containing a descriptor named
“element” instead of “foo” and I end up getting null values for this column
(and valid values for the primitive columns).
This is how I’m applying the projection to the ParquetFileReader “r”.
r.setRequestedSchema(projectionType)
I’d appreciate some pointers on general approach here.
Thanks,
Andy.