Re: Interpreting ORC Java Reference

Owen O'Malley Wed, 22 Jul 2020 09:29:34 -0700

On Mon, Jul 20, 2020 at 7:01 PM Ryan Schachte <[email protected]>
wrote:


> Hi team,
> apologies for the last email, believe I sent too early. I'm interested in
> better understanding the ORC reference guide in the docs and wanted to
> clarify some things to see if I'm understanding correctly.
>
> I realize for the *VectorizedRowBatch* approach, we write in chunks of 1024
> rows and the *ColumnVectors* encapsulate this data for each respective
> column.
>
> I have a couple questions on this:
>
> *1)* When I'm looking at the file composition of an ORC file, I see the
> stripes are roughly 250mb. Are there *N* number of *VectorizedRowBatch(es)
> *per
> stripe in the output ORC file?
>

The use of VectorizedRowBatch is to lower the overhead of the reading or
writing, so there isn't such
a requirement. The VectorizedRowBatch edges aren't recorded in the file, so
you won't get the same
sized VectorizedRowBatches when you read it back. Now in terms of
implementation, it checks for a stripe
boundary at the end of the batch, so it mostly will.


>
> *2)* With respect to adding row batches to the writer (i.e
> *orcWriter.addRowBatch(batch)*), do I have multiple batches in a single
> file? I assume because 1024 rows is still a small file size, I would write
> N number of row batches (ie, N calls of addRowBatch on the OrcWriter) until
> some parent criteria is satisfied.
>

Yes, you can and should write many batches to the same file. Batches can be
anything from 1 row up to their
maximum, which defaults to 1024. You'll get the same logical file if you
write the file with 1 million batches of
1 row each or 1000 batches of 1000 rows each. (The stripe boundaries may
end up different though.)

.. Owen


> Thanks!
> Ryan
>

Re: Interpreting ORC Java Reference

Reply via email to