Re: UTF8 and Parquet

Ryan Blue Thu, 26 Mar 2015 13:03:07 -0700

On 03/26/2015 12:36 PM, Daniel St. John wrote:

Hello,


I updated to the latest versions of everything in the Parquet ecosystem and the 
annotations in the message are coming out when reading the parquet file, so 
excuse the last communication please.

I'm glad you fixed it. Sorry that we seem to have missed that e-mail;we're normally pretty good about replying to questions like that.

Question: Can I open a Parquet Fie with an instance of FSDataInputStream 
instead of Path?

It doesn't look like that's possible in today's API. If you'reinterested in a stream-based reader, we can help you add one to theproject. There's no reason why we shouldn't be able to build one as longas the stream is Seekable.

What I have done was inspired from the CSV to Parquet example on GitHub. We are 
using Parquet as a storage for our proprietary record format. We also are 
reading existing Parquet, then translating to our proprietary record format.  
In short when I open a Parquet file with File or Path, then query the footer 
for the message type, using the extra info annotation, I derive a our Record 
schema in the end, and then read the data from parquet into our record format 
one a time. It is working well, records go in and come out, and data checks. 
Below is a summary of what I am doing below:

1> The message schema:

  Path parquetFilePath = ..

ParquetMetadata readFooter = null;

readFooter = ParquetFileReader.readFooter(configuration, parquetFilePath);

MessageType schema = readFooter.getFileMetaData().getSchema();



2> Then a reader :


Path path = ..

GroupReadSupport readSupport = new GroupReadSupport();

readSupport.init(configuration, null, schemaParquet);

ParquetReader<Group> reader;

try {

reader = new ParquetReader<Group>(path, readSupport);

} catch (IOException e) {

LOG.error("We can not create Parquet Reader " + e) ;

e.printStackTrace();

throw new ReadParquestFileException(e);

}

I don't recommend using example object model, and I think it should beremoved from the public API because it is misleading. That's just anexample of how one would write an object model and is not intended forreal use.

Anyway, what you have so far looks okay otherwise. But I seriouslyrecommend either writing your own object model (to go directly to/fromthe objects returned by your RedPoint format) or reusing another onethat is intended for real-world use, like parquet-avro.

3> Get Data sequentially:

Group group;

// my record
Record dmRecord =..
<SNIP>
return dmRecord;

Question: How to I do this with the FSDataInputStream instead of Path? Seems 
like Path is baked in? I have the requirement to work with  FSDataInputStream 
over Path and File.

You would have to build a stream reader in parquet-hadoop,unfortunately. As I said, we can help you do that and get it into theproject.


rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: UTF8 and Parquet

Reply via email to