Hi all, I am using the Java API to write ORC from a custom Java process to S3. Can I control the compressed file size output to a certain limit? For example, let's say I don't want any files to be larger than 50MB.
The flow in my code is: 1. Create an ORC writer: OrcFile.*createWriter* *2. *Add rows 1000 at a time using VectorizedRowBatch: writer.addRowBatch( batch) 3. Close the writer when no more data is available to be added: writer .close(); This creates one file on S3 whose size is determined by however much data was written + compression. The API doesn't seem to have any way to pre-determine how much data will be written before the writer is closed, so I can't use that information to decide when to close the writer. If I write 100MB but would like 50MB files, is there a way to configure the OrcWriter to split into two files based on a size limit? Followup question: when writing to S3, is each row batch flushed out of memory when it is added, or is it buffered in memory until the writer is closed? My reason for wanting a limit (e.g. 50MB) is to ensure that Java memory is not being buffered beyond that point (as my Java process has multiple OrcWriters running concurrently, for multiple input streams). If this buffering in memory is not an issue, maybe I don't need to worry about limiting the file size? Thanks for your help, Eric
