Hi Ryan, Thanks for following up. This all makes sense now.
Thanh On Wed, Aug 5, 2015 at 12:33 PM, Ryan Blue <[email protected]> wrote: > Hi Thanh, > > This varies a bit by object model to fit whatever makes sense for what > you're using. The Avro support, for example, allows you to pass an Avro > schema with the subset of columns you're interested in: > > AvroReadSupport.setRequestedProjection(conf, avroSchema); > > Thrift, on the other hand, doesn't have a schema object you can change > like Avro, so it allows you to pass a filter that will be applied to > columns: > > ThriftReadSupport.setProjectionPushdown(conf, > "persons/id;persons/email"); > > In the end, these will set the property that Sergio pointed you to, but I > recommend using the right one for the object model you're working with. > > rb > > > On 08/05/2015 10:04 AM, Thanh Do wrote: > >> Thank you for your prompt response, Sergio. >> >> I've looked at parquet-tools and found some sample code >> in DumpCommand that can be leveraged for this purpose. >> >> Specifically, we could pass a List<ColumnDescriptor> of interested columns >> to ParquetFileReader constructor. For instance: >> >> freader = new ParquetFileReader( >> conf, meta.getFileMetaData(), inpath, blocks, columns); >> PageReadStore store = freader.readNextRowGroup(); >> long page = 1; >> long total = blocks.size(); >> long offset = 1; >> while (store != null) { >> ColumnReadStoreImpl crstore = new ColumnReadStoreImpl( >> store, new DumpGroupConverter(), schema, >> meta.getFileMetaData().getCreatedBy()); >> for (ColumnDescriptor column : columns) { >> dump(out, crstore, column, page, total, offset); >> } >> offset += store.getRowCount(); >> store = freader.readNextRowGroup(); >> page++; >> } >> >> My question is that which one is recommended? Through ParquetReader >> interface >> or through ParquetFileReader and ColumnReaderStoreImpl? >> >> Thanks, >> Thanh >> >> On Wed, Aug 5, 2015 at 11:34 AM, Sergio Pena <[email protected]> >> wrote: >> >> Hi Thanh, >>> >>> I've used the "parquet.read.schema' variable to select the columns I want >>> Parquet to read from my file. This variable must be set to the >>> Configuration object you should pass to the ParquetReader. For instance: >>> >>> Configuration configuration = new Configuration(); >>> configuration.set("parquet.read.schema", "message request_schema { >>> required >>> int32 a; }"); >>> ParquetReader<Group> reader = ParquetReader.builder(new >>> GroupReadSupport(), >>> parquetFile).withConf(configuration).build(); >>> >>> However, if you have you're own read support class that inherits from >>> ReadSupport, then you should pass such requested schema to the >>> ReadContext >>> object you should return from the overriden init() method. This this how >>> Hive works: >>> >>> public class DataWritableReadSupport extends ReadSupport<ArrayWritable> { >>> ... >>> @Overriden >>> public org.apache.parquet.hadoop.api.ReadSupport.ReadContext >>> init(InitContext context) { >>> MessageType requestedSchemaByUser = >>> MessageTypeParser.parseMessageType("message request_schema { >>> required int32 a; }"); >>> ... >>> return new ReadContext(requestedSchemaByUser, contextMetadata); >>> } >>> ... >>> } >>> >>> - Sergio >>> >>> On Wed, Aug 5, 2015 at 10:41 AM, Thanh Do <[email protected]> >>> wrote: >>> >>> Hi all, >>>> >>>> I am new to Parquet and looking for a way to iterate through rows >>>> with selected columns. In particular, I am looking for APIs that >>>> allow users to set some reading options (such as columns of interest) >>>> so that Parquet Reader read() would return record containing only >>>> >>> selected >>> >>>> columns. >>>> >>>> I know that Hive ORC provides such APIs >>>> (as in here: >>>> >>>> >>>> >>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/1.2.0/org/apache/hadoop/hive/ql/io/orc/Reader.java#Reader.Options.include%28boolean%5B%5D%29 >>> >>>> ) >>>> and just wonder if Parquet provides a similar way to do that. >>>> >>>> Best, >>>> Thanh >>>> >>>> >>> >> > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
