Many thanks for this!
Ryan Blue <[email protected]> schrieb am 17:44 Dienstag, 8.August
2017:
Joerg,
Parquet is columnar storage, but a lot of execution engines actually
operate on rows. It's common to select columns and push down filters, but
then want the rows reconstructed because it is difficult to work with
columnar data. Spark operates on rows and when Parquet stands in for
another format, like Avro, the expectation is to get rows from the reader.
You can use the parquet-arrow project if you want columnar access:
https://github.com/apache/parquet-mr/tree/master/parquet-arrow
For your second question about accessing file metadata, you can read the
entire footer using the methods in ParquetFileReader:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
rb
On Mon, Aug 7, 2017 at 11:35 PM, Jörg Anders <[email protected]>
wrote:
> Hi all!
> I program in Java and I use PARQUET with HADOOP because I need to
> write/read to/from hdfs. I'm a bit confused because of the contradiction
> between the columnar nature of PARQUET and the ParquetReader/Writer in
> version 1.9.0 of parquet-hadoop from org.apache.parquetand version 1.6.0
> of parquet-hadoop from com.twitter.
> They require to write line by line even if I had the columns at hand:
> Iterator<Valuet> itr = theValues.iterator();
> while (itr.hasNext()) {
> writer.write(groupFromValue(itr.next()));
> }
> writer.close();
> Did I fail to notice a package or function? Is there a way to write
> columns directly?
> If not: Could please anybody explain the contradiction between the
> columnar nature of PARQUET and the row by rowbased read/write stratagy.
>
> Is it for technical reasons, perhapsbecause of some requirements of the
> record shredding and assembly algorithm?
> An URL would suffice.
> Thank you in advance
> Joerg
--
Ryan Blue
Software Engineer
Netflix