Joerg, Parquet is columnar storage, but a lot of execution engines actually operate on rows. It's common to select columns and push down filters, but then want the rows reconstructed because it is difficult to work with columnar data. Spark operates on rows and when Parquet stands in for another format, like Avro, the expectation is to get rows from the reader. You can use the parquet-arrow project if you want columnar access: https://github.com/apache/parquet-mr/tree/master/parquet-arrow
For your second question about accessing file metadata, you can read the entire footer using the methods in ParquetFileReader: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java rb On Mon, Aug 7, 2017 at 11:35 PM, Jörg Anders <[email protected]> wrote: > Hi all! > I program in Java and I use PARQUET with HADOOP because I need to > write/read to/from hdfs. I'm a bit confused because of the contradiction > between the columnar nature of PARQUET and the ParquetReader/Writer in > version 1.9.0 of parquet-hadoop from org.apache.parquetand version 1.6.0 > of parquet-hadoop from com.twitter. > They require to write line by line even if I had the columns at hand: > Iterator<Valuet> itr = theValues.iterator(); > while (itr.hasNext()) { > writer.write(groupFromValue(itr.next())); > } > writer.close(); > Did I fail to notice a package or function? Is there a way to write > columns directly? > If not: Could please anybody explain the contradiction between the > columnar nature of PARQUET and the row by rowbased read/write stratagy. > > Is it for technical reasons, perhapsbecause of some requirements of the > record shredding and assembly algorithm? > An URL would suffice. > Thank you in advance > Joerg -- Ryan Blue Software Engineer Netflix
