Re: why line by line

Wes McKinney Mon, 07 Aug 2017 07:32:16 -0700

hi Joerg,

It sounds like you are referring to the record-based writer API that's
found in parquet-mr, which was originally designed for use in Hadoop
MapReduce (if I understand correctly).

There is no requirement to write Parquet files in this fashion. The
Parquet C++ writer and reader API
(https://github.com/apache/parquet-cpp) is vectorized / column based.
Some systems (like Spark, Dremio, Drill, I believe) have vectorized
Java implementations.

There is interest in creating an Arrow-based columnar reader and
writer API for Java within parquet-mr; that would be a promising
approach.

- Wes

On Mon, Aug 7, 2017 at 9:38 AM, Jörg Anders
<[email protected]> wrote:
> Hi all!
> I have a general question concerning PARQUET.
> PARQUET is a columnar store. But the typical Apache PARQUET Writer/Reader 
> loops use a row by row strategy:
> Iterator<Valuet> itr = theValues.iterator();
> while (itr.hasNext()) {
>             writer.write(groupFromValue(itr.next()));
> }
> writer.close();
> Assume I had the columns at hand. This procedure requires to convert them 
> into rows. Is there a way to write columns directly?If not: Could please 
> anybody explain the contradiction between the columnar nature of PARQUET and 
> a the row by rowbased read/write stratagy.
>
> Is it for technical reasons, perhapsbecause of some requirements of  the 
> record shredding and assembly algorithm?
> An URL would suffice.
> Thank you in advance
> Joerg
>

Re: why line by line

Reply via email to