Thanks. I tried this.

    val projection: Seq[column.ColumnDescriptor] = ....//filter the columns I 
want from the schema

    val projectionBuilder = Types.buildMessage()
    for (col <- projection) {
      projectionBuilder.addField(Types.buildMessage().named(col.getPath.head))
    }
    r.setRequestedSchema(projectionBuilder.named("tbd"))

This fails when reading the file with "[some_col_name] optional int64 
some_col_name is not in the store" where "some_col_name" is not part of my 
projection.

Any idea what I need to do next?

Thanks,

Andy.

On 4/13/18, 12:08 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:

    I'd suggest using the Types builders to create your projection schema
    (MessageType), then passing that schema to the
    ParquetFileReader.setRequestedSchema method you found.
    
    On Fri, Apr 13, 2018 at 10:40 AM, Andy Grove <andy.gr...@rms.com> wrote:
    
    > Hi Ryan,
    >
    > I'm writing some low-level performance tests to try and find a bottleneck
    > on our platform and have intentionally excluded Spark/Thrift/Presto etc 
and
    > want to test Parquet directly both with local files and against our HDFS
    > cluster to get performance metrics. Our parquet files were created by 
Spark
    > and contain schema meta-data.
    >
    > Here is my code for opening the file:
    >
    >     val footer = ParquetFileReader.open(file, options)
    >     val schema = footer.getFileMetaData.getSchema
    >     val r = new ParquetFileReader(file, options)
    >
    > I can call schema.getColumns and see all of the column definitions.
    >
    > I have my query working fine but it is reading all the columns and I want
    > to push down the projection so it only reads the 5 columns I need.
    >
    > I see that there are some versions of the ParquetFileReader constructors
    > that accept a List[ColumnDescriptor] and I did try that but ran into 
errors.
    >
    > What would you suggest?
    >
    > Thanks,
    >
    > Andy.
    >
    >
    > On 4/13/18, 11:34 AM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
    >
    >     Andy, what object model are you using to read? Usually you don't have 
a
    >     list of column descriptors, you have an Avro read schema or a Thrift
    > class
    >     or something.
    >
    >     On Fri, Apr 13, 2018 at 10:31 AM, Andy Grove <andy.gr...@rms.com>
    > wrote:
    >
    >     > Hi,
    >     >
    >     > I’m trying to read a parquet file with a projection from Scala and I
    > can’t
    >     > find docs or examples for the correct way to do this.
    >     >
    >     > I have the file schema and have filtered for the list of columns I
    > need,
    >     > so I have a List of ColumnDescriptors.
    >     >
    >     > It looks like I should call ParquetFileReader.setRequestedSchema()
    > but I
    >     > can’t find an example of constructing the required MessageType
    > parameter.
    >     >
    >     > I’d appreciate any pointers on what to do next.
    >     >
    >     > Thanks,
    >     >
    >     > Andy.
    >     >
    >     >
    >     >
    >
    >
    >     --
    >     Ryan Blue
    >     Software Engineer
    >     Netflix
    >
    >
    >
    
    
    -- 
    Ryan Blue
    Software Engineer
    Netflix
    

Reply via email to