hi Joerg, It sounds like you are referring to the record-based writer API that's found in parquet-mr, which was originally designed for use in Hadoop MapReduce (if I understand correctly).
There is no requirement to write Parquet files in this fashion. The Parquet C++ writer and reader API (https://github.com/apache/parquet-cpp) is vectorized / column based. Some systems (like Spark, Dremio, Drill, I believe) have vectorized Java implementations. There is interest in creating an Arrow-based columnar reader and writer API for Java within parquet-mr; that would be a promising approach. - Wes On Mon, Aug 7, 2017 at 9:38 AM, Jörg Anders <[email protected]> wrote: > Hi all! > I have a general question concerning PARQUET. > PARQUET is a columnar store. But the typical Apache PARQUET Writer/Reader > loops use a row by row strategy: > Iterator<Valuet> itr = theValues.iterator(); > while (itr.hasNext()) { > writer.write(groupFromValue(itr.next())); > } > writer.close(); > Assume I had the columns at hand. This procedure requires to convert them > into rows. Is there a way to write columns directly?If not: Could please > anybody explain the contradiction between the columnar nature of PARQUET and > a the row by rowbased read/write stratagy. > > Is it for technical reasons, perhapsbecause of some requirements of the > record shredding and assembly algorithm? > An URL would suffice. > Thank you in advance > Joerg >
