jorisvandenbossche edited a comment on pull request #9702:
URL: https://github.com/apache/arrow/pull/9702#issuecomment-984540887


   Reading a bit more about it, I realized now: I think the `stripe_size` is a 
size in bytes, while I assumed it was number of rows. That's also an additional 
reason why my example above didn't work. 
   So passing the stripe size as batch size as you did in the last commit is 
therefore incorrect, I think.
   
   I think this has the following consequences:
   
   - The file is written in batches (our `ORCFileWriter::Write` calls 
`orcc::Writer::add` multiple times following `kOrcWriterBatchSize`, which is 
expressed in number of rows), and it seems it is only per batch added that is 
is checked at the end to write a stripe or not. So that means that you can't 
create multiple stripes from a single batch? But only add batches until you 
reach the minimum stripe size and then create a stripe, and the next batches 
being added will form the next stripe. So this seems to generally assume that 
batches are smaller than stripes?
   - For this reason, would it make sense to be able to specify `batch_size` as 
well? Because if you want a smaller stripe size, you might need a smaller batch 
size as well. Although it is currently set to `128 * 1024`, which seems small 
enough for practical use?
   
   In practice this also means that you need to create a test dataset that is 
larger than this default batch size to see the effect of `stripe_size`: (using 
your branch without the last commit):
   
   ```python
   # table which will be written as two batches
   >>> table = pa.table({'a': np.random.randn((128 * 1024) + 1)})
   # with default stripe size, you still get a single stripe
   >>> orc.write_table(table, "test_orc_size.orc", compression="zlib")
   >>> orc.ORCFile("test_orc_size.orc").nstripes
   1
   # but with small stripe size, you actually get two stripes as expected 
(setting it to arbitrary low 10bytes for testing)
   >>> orc.write_table(table, "test_orc_size.orc", stripe_size=10, 
compression="zlib")
   >>>orc.ORCFile("test_orc_size.orc").nstripes
   2
   # and so further increasing the size of the table works as expected
   >>> table = pa.table({'a': np.random.randn((128 * 1024) *2 + 1)})
   >>> orc.write_table(table, "test_orc_size.orc", stripe_size=10, 
compression="zlib")
   >>>orc.ORCFile("test_orc_size.orc").nstripes
   3
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to