On Mon, Jul 20, 2020 at 7:01 PM Ryan Schachte <[email protected]> wrote:
> Hi team, > apologies for the last email, believe I sent too early. I'm interested in > better understanding the ORC reference guide in the docs and wanted to > clarify some things to see if I'm understanding correctly. > > I realize for the *VectorizedRowBatch* approach, we write in chunks of 1024 > rows and the *ColumnVectors* encapsulate this data for each respective > column. > > I have a couple questions on this: > > *1)* When I'm looking at the file composition of an ORC file, I see the > stripes are roughly 250mb. Are there *N* number of *VectorizedRowBatch(es) > *per > stripe in the output ORC file? > The use of VectorizedRowBatch is to lower the overhead of the reading or writing, so there isn't such a requirement. The VectorizedRowBatch edges aren't recorded in the file, so you won't get the same sized VectorizedRowBatches when you read it back. Now in terms of implementation, it checks for a stripe boundary at the end of the batch, so it mostly will. > > *2)* With respect to adding row batches to the writer (i.e > *orcWriter.addRowBatch(batch)*), do I have multiple batches in a single > file? I assume because 1024 rows is still a small file size, I would write > N number of row batches (ie, N calls of addRowBatch on the OrcWriter) until > some parent criteria is satisfied. > Yes, you can and should write many batches to the same file. Batches can be anything from 1 row up to their maximum, which defaults to 1024. You'll get the same logical file if you write the file with 1 million batches of 1 row each or 1000 batches of 1000 rows each. (The stripe boundaries may end up different though.) .. Owen > Thanks! > Ryan >
