Hi Thanh,

This varies a bit by object model to fit whatever makes sense for what you're using. The Avro support, for example, allows you to pass an Avro schema with the subset of columns you're interested in:

  AvroReadSupport.setRequestedProjection(conf, avroSchema);

Thrift, on the other hand, doesn't have a schema object you can change like Avro, so it allows you to pass a filter that will be applied to columns:

  ThriftReadSupport.setProjectionPushdown(conf,
      "persons/id;persons/email");

In the end, these will set the property that Sergio pointed you to, but I recommend using the right one for the object model you're working with.

rb

On 08/05/2015 10:04 AM, Thanh Do wrote:
Thank you for your prompt response, Sergio.

I've looked at parquet-tools and found some sample code
in DumpCommand that can be leveraged for this purpose.

Specifically, we could pass a List<ColumnDescriptor> of interested columns
to ParquetFileReader constructor. For instance:

             freader = new ParquetFileReader(
             conf, meta.getFileMetaData(), inpath, blocks, columns);
             PageReadStore store = freader.readNextRowGroup();
             long page = 1;
             long total = blocks.size();
             long offset = 1;
             while (store != null) {
             ColumnReadStoreImpl crstore = new ColumnReadStoreImpl(
                       store, new DumpGroupConverter(), schema,
                       meta.getFileMetaData().getCreatedBy());
             for (ColumnDescriptor column : columns) {
             dump(out, crstore, column, page, total, offset);
             }
               offset += store.getRowCount();
               store = freader.readNextRowGroup();
               page++;
             }

My question is that which one is recommended? Through ParquetReader
interface
or through ParquetFileReader and ColumnReaderStoreImpl?

Thanks,
Thanh

On Wed, Aug 5, 2015 at 11:34 AM, Sergio Pena <[email protected]>
wrote:

Hi Thanh,

I've used the "parquet.read.schema' variable to select the columns I want
Parquet to read from my file. This variable must be set to the
Configuration object you should pass to the ParquetReader. For instance:

Configuration configuration = new Configuration();
configuration.set("parquet.read.schema", "message request_schema { required
int32 a; }");
ParquetReader<Group> reader = ParquetReader.builder(new GroupReadSupport(),
parquetFile).withConf(configuration).build();

However, if you have you're own read support class that inherits from
ReadSupport, then you should pass such requested schema to the ReadContext
object you should return from the overriden init() method. This this how
Hive works:

public class DataWritableReadSupport extends ReadSupport<ArrayWritable> {
...
   @Overriden
   public org.apache.parquet.hadoop.api.ReadSupport.ReadContext
init(InitContext context) {
     MessageType requestedSchemaByUser =
         MessageTypeParser.parseMessageType("message request_schema {
required int32 a; }");
     ...
     return new ReadContext(requestedSchemaByUser, contextMetadata);
   }
...
}

- Sergio

On Wed, Aug 5, 2015 at 10:41 AM, Thanh Do <[email protected]> wrote:

Hi all,

I am new to Parquet and looking for a way to iterate through rows
with selected columns. In particular, I am looking for APIs that
allow users to set some reading options (such as columns of interest)
so that Parquet Reader read() would return record containing only
selected
columns.

I know that Hive ORC provides such APIs
(as in here:


http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/1.2.0/org/apache/hadoop/hive/ql/io/orc/Reader.java#Reader.Options.include%28boolean%5B%5D%29
)
and just wonder if Parquet provides a similar way to do that.

Best,
Thanh





--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to