Thanks. I tried this. val projection: Seq[column.ColumnDescriptor] = ....//filter the columns I want from the schema
val projectionBuilder = Types.buildMessage() for (col <- projection) { projectionBuilder.addField(Types.buildMessage().named(col.getPath.head)) } r.setRequestedSchema(projectionBuilder.named("tbd")) This fails when reading the file with "[some_col_name] optional int64 some_col_name is not in the store" where "some_col_name" is not part of my projection. Any idea what I need to do next? Thanks, Andy. On 4/13/18, 12:08 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote: I'd suggest using the Types builders to create your projection schema (MessageType), then passing that schema to the ParquetFileReader.setRequestedSchema method you found. On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove <andy.gr...@rms.com> wrote: > Hi Ryan, > > I'm writing some low-level performance tests to try and find a bottleneck > on our platform and have intentionally excluded Spark/Thrift/Presto etc and > want to test Parquet directly both with local files and against our HDFS > cluster to get performance metrics. Our parquet files were created by Spark > and contain schema meta-data. > > Here is my code for opening the file: > > val footer = ParquetFileReader.open(file, options) > val schema = footer.getFileMetaData.getSchema > val r = new ParquetFileReader(file, options) > > I can call schema.getColumns and see all of the column definitions. > > I have my query working fine but it is reading all the columns and I want > to push down the projection so it only reads the 5 columns I need. > > I see that there are some versions of the ParquetFileReader constructors > that accept a List[ColumnDescriptor] and I did try that but ran into errors. > > What would you suggest? > > Thanks, > > Andy. > > > On 4/13/18, 11:34 AM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote: > > Andy, what object model are you using to read? Usually you don't have a > list of column descriptors, you have an Avro read schema or a Thrift > class > or something. > > On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove <andy.gr...@rms.com> > wrote: > > > Hi, > > > > I’m trying to read a parquet file with a projection from Scala and I > can’t > > find docs or examples for the correct way to do this. > > > > I have the file schema and have filtered for the list of columns I > need, > > so I have a List of ColumnDescriptors. > > > > It looks like I should call ParquetFileReader.setRequestedSchema() > but I > > can’t find an example of constructing the required MessageType > parameter. > > > > I’d appreciate any pointers on what to do next. > > > > Thanks, > > > > Andy. > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix > > > -- Ryan Blue Software Engineer Netflix