Re: Parquet read options

Thanh Do Wed, 05 Aug 2015 10:05:37 -0700

Thank you for your prompt response, Sergio.

I've looked at parquet-tools and found some sample code
in DumpCommand that can be leveraged for this purpose.


Specifically, we could pass a List<ColumnDescriptor> of interested columns
to ParquetFileReader constructor. For instance:

            freader = new ParquetFileReader(
            conf, meta.getFileMetaData(), inpath, blocks, columns);
            PageReadStore store = freader.readNextRowGroup();
            long page = 1;
            long total = blocks.size();
            long offset = 1;
            while (store != null) {
            ColumnReadStoreImpl crstore = new ColumnReadStoreImpl(
                      store, new DumpGroupConverter(), schema,
                      meta.getFileMetaData().getCreatedBy());
            for (ColumnDescriptor column : columns) {
            dump(out, crstore, column, page, total, offset);
            }
              offset += store.getRowCount();
              store = freader.readNextRowGroup();
              page++;
            }

My question is that which one is recommended? Through ParquetReader
interface
or through ParquetFileReader and ColumnReaderStoreImpl?

Thanks,
Thanh

On Wed, Aug 5, 2015 at 11:34 AM, Sergio Pena <[email protected]>
wrote:

> Hi Thanh,
>
> I've used the "parquet.read.schema' variable to select the columns I want
> Parquet to read from my file. This variable must be set to the
> Configuration object you should pass to the ParquetReader. For instance:
>
> Configuration configuration = new Configuration();
> configuration.set("parquet.read.schema", "message request_schema { required
> int32 a; }");
> ParquetReader<Group> reader = ParquetReader.builder(new GroupReadSupport(),
> parquetFile).withConf(configuration).build();
>
> However, if you have you're own read support class that inherits from
> ReadSupport, then you should pass such requested schema to the ReadContext
> object you should return from the overriden init() method. This this how
> Hive works:
>
> public class DataWritableReadSupport extends ReadSupport<ArrayWritable> {
> ...
>   @Overriden
>   public org.apache.parquet.hadoop.api.ReadSupport.ReadContext
> init(InitContext context) {
>     MessageType requestedSchemaByUser =
>         MessageTypeParser.parseMessageType("message request_schema {
> required int32 a; }");
>     ...
>     return new ReadContext(requestedSchemaByUser, contextMetadata);
>   }
> ...
> }
>
> - Sergio
>
> On Wed, Aug 5, 2015 at 10:41 AM, Thanh Do <[email protected]> wrote:
>
> > Hi all,
> >
> > I am new to Parquet and looking for a way to iterate through rows
> > with selected columns. In particular, I am looking for APIs that
> > allow users to set some reading options (such as columns of interest)
> > so that Parquet Reader read() would return record containing only
> selected
> > columns.
> >
> > I know that Hive ORC provides such APIs
> > (as in here:
> >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/1.2.0/org/apache/hadoop/hive/ql/io/orc/Reader.java#Reader.Options.include%28boolean%5B%5D%29
> > )
> > and just wonder if Parquet provides a similar way to do that.
> >
> > Best,
> > Thanh
> >
>

Re: Parquet read options

Reply via email to