On 03/26/2015 12:36 PM, Daniel St. John wrote:
Hello,
I updated to the latest versions of everything in the Parquet ecosystem and the
annotations in the message are coming out when reading the parquet file, so
excuse the last communication please.
I'm glad you fixed it. Sorry that we seem to have missed that e-mail;
we're normally pretty good about replying to questions like that.
Question: Can I open a Parquet Fie with an instance of FSDataInputStream
instead of Path?
It doesn't look like that's possible in today's API. If you're
interested in a stream-based reader, we can help you add one to the
project. There's no reason why we shouldn't be able to build one as long
as the stream is Seekable.
What I have done was inspired from the CSV to Parquet example on GitHub. We are
using Parquet as a storage for our proprietary record format. We also are
reading existing Parquet, then translating to our proprietary record format.
In short when I open a Parquet file with File or Path, then query the footer
for the message type, using the extra info annotation, I derive a our Record
schema in the end, and then read the data from parquet into our record format
one a time. It is working well, records go in and come out, and data checks.
Below is a summary of what I am doing below:
1> The message schema:
Path parquetFilePath = ..
ParquetMetadata readFooter = null;
readFooter = ParquetFileReader.readFooter(configuration, parquetFilePath);
MessageType schema = readFooter.getFileMetaData().getSchema();
2> Then a reader :
Path path = ..
GroupReadSupport readSupport = new GroupReadSupport();
readSupport.init(configuration, null, schemaParquet);
ParquetReader<Group> reader;
try {
reader = new ParquetReader<Group>(path, readSupport);
} catch (IOException e) {
LOG.error("We can not create Parquet Reader " + e) ;
e.printStackTrace();
throw new ReadParquestFileException(e);
}
I don't recommend using example object model, and I think it should be
removed from the public API because it is misleading. That's just an
example of how one would write an object model and is not intended for
real use.
Anyway, what you have so far looks okay otherwise. But I seriously
recommend either writing your own object model (to go directly to/from
the objects returned by your RedPoint format) or reusing another one
that is intended for real-world use, like parquet-avro.
3> Get Data sequentially:
Group group;
// my record
Record dmRecord =..
<SNIP>
return dmRecord;
Question: How to I do this with the FSDataInputStream instead of Path? Seems
like Path is baked in? I have the requirement to work with FSDataInputStream
over Path and File.
You would have to build a stream reader in parquet-hadoop,
unfortunately. As I said, we can help you do that and get it into the
project.
rb
--
Ryan Blue
Software Engineer
Cloudera, Inc.