Re: columnar writer with Java

Ryan Blue Tue, 08 Aug 2017 08:44:13 -0700

Joerg,

Parquet is columnar storage, but a lot of execution engines actually
operate on rows. It's common to select columns and push down filters, but
then want the rows reconstructed because it is difficult to work with
columnar data. Spark operates on rows and when Parquet stands in for
another format, like Avro, the expectation is to get rows from the reader.
You can use the parquet-arrow project if you want columnar access:
https://github.com/apache/parquet-mr/tree/master/parquet-arrow


For your second question about accessing file metadata, you can read the
entire footer using the methods in ParquetFileReader:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

rb

On Mon, Aug 7, 2017 at 11:35 PM, Jörg Anders <[email protected]>
wrote:

> Hi all!
> I program in Java and I use PARQUET with HADOOP because I need to
> write/read to/from hdfs.  I'm a bit confused because of the contradiction
> between the columnar nature of PARQUET and the ParquetReader/Writer in
> version 1.9.0 of parquet-hadoop  from org.apache.parquetand version 1.6.0
> of parquet-hadoop  from com.twitter.
> They require to write line by line even if I had the columns at hand:
> Iterator<Valuet> itr = theValues.iterator();
> while (itr.hasNext()) {
>             writer.write(groupFromValue(itr.next()));
> }
> writer.close();
> Did I fail to notice a package or function?  Is there a way to write
> columns directly?
> If not: Could please anybody explain the contradiction between the
> columnar nature of PARQUET and the row by rowbased read/write stratagy.
>
> Is it for technical reasons, perhapsbecause of some requirements of  the
> record shredding and assembly algorithm?
> An URL would suffice.
> Thank you in advance
> Joerg




-- 
Ryan Blue
Software Engineer
Netflix

Re: columnar writer with Java

Reply via email to