Re: Parquet read options

Thanh Do Thu, 06 Aug 2015 14:00:41 -0700

Hi Ryan,

Thanks for following up. This all makes sense now.


Thanh

On Wed, Aug 5, 2015 at 12:33 PM, Ryan Blue <[email protected]> wrote:

> Hi Thanh,
>
> This varies a bit by object model to fit whatever makes sense for what
> you're using. The Avro support, for example, allows you to pass an Avro
> schema with the subset of columns you're interested in:
>
>   AvroReadSupport.setRequestedProjection(conf, avroSchema);
>
> Thrift, on the other hand, doesn't have a schema object you can change
> like Avro, so it allows you to pass a filter that will be applied to
> columns:
>
>   ThriftReadSupport.setProjectionPushdown(conf,
>       "persons/id;persons/email");
>
> In the end, these will set the property that Sergio pointed you to, but I
> recommend using the right one for the object model you're working with.
>
> rb
>
>
> On 08/05/2015 10:04 AM, Thanh Do wrote:
>
>> Thank you for your prompt response, Sergio.
>>
>> I've looked at parquet-tools and found some sample code
>> in DumpCommand that can be leveraged for this purpose.
>>
>> Specifically, we could pass a List<ColumnDescriptor> of interested columns
>> to ParquetFileReader constructor. For instance:
>>
>>              freader = new ParquetFileReader(
>>              conf, meta.getFileMetaData(), inpath, blocks, columns);
>>              PageReadStore store = freader.readNextRowGroup();
>>              long page = 1;
>>              long total = blocks.size();
>>              long offset = 1;
>>              while (store != null) {
>>              ColumnReadStoreImpl crstore = new ColumnReadStoreImpl(
>>                        store, new DumpGroupConverter(), schema,
>>                        meta.getFileMetaData().getCreatedBy());
>>              for (ColumnDescriptor column : columns) {
>>              dump(out, crstore, column, page, total, offset);
>>              }
>>                offset += store.getRowCount();
>>                store = freader.readNextRowGroup();
>>                page++;
>>              }
>>
>> My question is that which one is recommended? Through ParquetReader
>> interface
>> or through ParquetFileReader and ColumnReaderStoreImpl?
>>
>> Thanks,
>> Thanh
>>
>> On Wed, Aug 5, 2015 at 11:34 AM, Sergio Pena <[email protected]>
>> wrote:
>>
>> Hi Thanh,
>>>
>>> I've used the "parquet.read.schema' variable to select the columns I want
>>> Parquet to read from my file. This variable must be set to the
>>> Configuration object you should pass to the ParquetReader. For instance:
>>>
>>> Configuration configuration = new Configuration();
>>> configuration.set("parquet.read.schema", "message request_schema {
>>> required
>>> int32 a; }");
>>> ParquetReader<Group> reader = ParquetReader.builder(new
>>> GroupReadSupport(),
>>> parquetFile).withConf(configuration).build();
>>>
>>> However, if you have you're own read support class that inherits from
>>> ReadSupport, then you should pass such requested schema to the
>>> ReadContext
>>> object you should return from the overriden init() method. This this how
>>> Hive works:
>>>
>>> public class DataWritableReadSupport extends ReadSupport<ArrayWritable> {
>>> ...
>>>    @Overriden
>>>    public org.apache.parquet.hadoop.api.ReadSupport.ReadContext
>>> init(InitContext context) {
>>>      MessageType requestedSchemaByUser =
>>>          MessageTypeParser.parseMessageType("message request_schema {
>>> required int32 a; }");
>>>      ...
>>>      return new ReadContext(requestedSchemaByUser, contextMetadata);
>>>    }
>>> ...
>>> }
>>>
>>> - Sergio
>>>
>>> On Wed, Aug 5, 2015 at 10:41 AM, Thanh Do <[email protected]>
>>> wrote:
>>>
>>> Hi all,
>>>>
>>>> I am new to Parquet and looking for a way to iterate through rows
>>>> with selected columns. In particular, I am looking for APIs that
>>>> allow users to set some reading options (such as columns of interest)
>>>> so that Parquet Reader read() would return record containing only
>>>>
>>> selected
>>>
>>>> columns.
>>>>
>>>> I know that Hive ORC provides such APIs
>>>> (as in here:
>>>>
>>>>
>>>>
>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/1.2.0/org/apache/hadoop/hive/ql/io/orc/Reader.java#Reader.Options.include%28boolean%5B%5D%29
>>>
>>>> )
>>>> and just wonder if Parquet provides a similar way to do that.
>>>>
>>>> Best,
>>>> Thanh
>>>>
>>>>
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Parquet read options

Reply via email to