Re: Parquet read options

Sergio Pena Wed, 05 Aug 2015 09:36:14 -0700

Hi Thanh,

I've used the "parquet.read.schema' variable to select the columns I want
Parquet to read from my file. This variable must be set to the
Configuration object you should pass to the ParquetReader. For instance:


Configuration configuration = new Configuration();
configuration.set("parquet.read.schema", "message request_schema { required
int32 a; }");
ParquetReader<Group> reader = ParquetReader.builder(new GroupReadSupport(),
parquetFile).withConf(configuration).build();

However, if you have you're own read support class that inherits from
ReadSupport, then you should pass such requested schema to the ReadContext
object you should return from the overriden init() method. This this how
Hive works:

public class DataWritableReadSupport extends ReadSupport<ArrayWritable> {
...
  @Overriden
  public org.apache.parquet.hadoop.api.ReadSupport.ReadContext
init(InitContext context) {
    MessageType requestedSchemaByUser =
        MessageTypeParser.parseMessageType("message request_schema {
required int32 a; }");
    ...
    return new ReadContext(requestedSchemaByUser, contextMetadata);
  }
...
}

- Sergio

On Wed, Aug 5, 2015 at 10:41 AM, Thanh Do <[email protected]> wrote:

> Hi all,
>
> I am new to Parquet and looking for a way to iterate through rows
> with selected columns. In particular, I am looking for APIs that
> allow users to set some reading options (such as columns of interest)
> so that Parquet Reader read() would return record containing only selected
> columns.
>
> I know that Hive ORC provides such APIs
> (as in here:
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/1.2.0/org/apache/hadoop/hive/ql/io/orc/Reader.java#Reader.Options.include%28boolean%5B%5D%29
> )
> and just wonder if Parquet provides a similar way to do that.
>
> Best,
> Thanh
>

Re: Parquet read options

Reply via email to