Thank you for your prompt response, Sergio.
I've looked at parquet-tools and found some sample code
in DumpCommand that can be leveraged for this purpose.
Specifically, we could pass a List<ColumnDescriptor> of interested columns
to ParquetFileReader constructor. For instance:
freader = new ParquetFileReader(
conf, meta.getFileMetaData(), inpath, blocks, columns);
PageReadStore store = freader.readNextRowGroup();
long page = 1;
long total = blocks.size();
long offset = 1;
while (store != null) {
ColumnReadStoreImpl crstore = new ColumnReadStoreImpl(
store, new DumpGroupConverter(), schema,
meta.getFileMetaData().getCreatedBy());
for (ColumnDescriptor column : columns) {
dump(out, crstore, column, page, total, offset);
}
offset += store.getRowCount();
store = freader.readNextRowGroup();
page++;
}
My question is that which one is recommended? Through ParquetReader
interface
or through ParquetFileReader and ColumnReaderStoreImpl?
Thanks,
Thanh
On Wed, Aug 5, 2015 at 11:34 AM, Sergio Pena <[email protected]>
wrote:
> Hi Thanh,
>
> I've used the "parquet.read.schema' variable to select the columns I want
> Parquet to read from my file. This variable must be set to the
> Configuration object you should pass to the ParquetReader. For instance:
>
> Configuration configuration = new Configuration();
> configuration.set("parquet.read.schema", "message request_schema { required
> int32 a; }");
> ParquetReader<Group> reader = ParquetReader.builder(new GroupReadSupport(),
> parquetFile).withConf(configuration).build();
>
> However, if you have you're own read support class that inherits from
> ReadSupport, then you should pass such requested schema to the ReadContext
> object you should return from the overriden init() method. This this how
> Hive works:
>
> public class DataWritableReadSupport extends ReadSupport<ArrayWritable> {
> ...
> @Overriden
> public org.apache.parquet.hadoop.api.ReadSupport.ReadContext
> init(InitContext context) {
> MessageType requestedSchemaByUser =
> MessageTypeParser.parseMessageType("message request_schema {
> required int32 a; }");
> ...
> return new ReadContext(requestedSchemaByUser, contextMetadata);
> }
> ...
> }
>
> - Sergio
>
> On Wed, Aug 5, 2015 at 10:41 AM, Thanh Do <[email protected]> wrote:
>
> > Hi all,
> >
> > I am new to Parquet and looking for a way to iterate through rows
> > with selected columns. In particular, I am looking for APIs that
> > allow users to set some reading options (such as columns of interest)
> > so that Parquet Reader read() would return record containing only
> selected
> > columns.
> >
> > I know that Hive ORC provides such APIs
> > (as in here:
> >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/1.2.0/org/apache/hadoop/hive/ql/io/orc/Reader.java#Reader.Options.include%28boolean%5B%5D%29
> > )
> > and just wonder if Parquet provides a similar way to do that.
> >
> > Best,
> > Thanh
> >
>