On 03/26/2015 12:36 PM, Daniel St. John wrote:
Hello,

I updated to the latest versions of everything in the Parquet ecosystem and the 
annotations in the message are coming out when reading the parquet file, so 
excuse the last communication please.

I'm glad you fixed it. Sorry that we seem to have missed that e-mail; we're normally pretty good about replying to questions like that.

Question: Can I open a Parquet Fie with an instance of FSDataInputStream 
instead of Path?

It doesn't look like that's possible in today's API. If you're interested in a stream-based reader, we can help you add one to the project. There's no reason why we shouldn't be able to build one as long as the stream is Seekable.

What I have done was inspired from the CSV to Parquet example on GitHub. We are 
using Parquet as a storage for our proprietary record format. We also are 
reading existing Parquet, then translating to our proprietary record format.  
In short when I open a Parquet file with File or Path, then query the footer 
for the message type, using the extra info annotation, I derive a our Record 
schema in the end, and then read the data from parquet into our record format 
one a time. It is working well, records go in and come out, and data checks. 
Below is a summary of what I am doing below:

1> The message schema:

  Path parquetFilePath = ..

ParquetMetadata readFooter = null;

readFooter = ParquetFileReader.readFooter(configuration, parquetFilePath);

MessageType schema = readFooter.getFileMetaData().getSchema();



2> Then a reader :


Path path = ..

GroupReadSupport readSupport = new GroupReadSupport();

readSupport.init(configuration, null, schemaParquet);

ParquetReader<Group> reader;

try {

reader = new ParquetReader<Group>(path, readSupport);

} catch (IOException e) {

LOG.error("We can not create Parquet Reader " + e) ;

e.printStackTrace();

throw new ReadParquestFileException(e);

}

I don't recommend using example object model, and I think it should be removed from the public API because it is misleading. That's just an example of how one would write an object model and is not intended for real use.

Anyway, what you have so far looks okay otherwise. But I seriously recommend either writing your own object model (to go directly to/from the objects returned by your RedPoint format) or reusing another one that is intended for real-world use, like parquet-avro.

3> Get Data sequentially:

Group group;

// my record
Record dmRecord =..
<SNIP>
return dmRecord;

Question: How to I do this with the FSDataInputStream instead of Path? Seems 
like Path is baked in? I have the requirement to work with  FSDataInputStream 
over Path and File.

You would have to build a stream reader in parquet-hadoop, unfortunately. As I said, we can help you do that and get it into the project.

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to