Hi, I added my opinions. > I would like to understand if there is already a way to combine two small > files
`orc-tools` convert command already supports simple merging like the following. orc-tools convert 1.orc 2.orc -o merged.orc > without having to read and write both the files into a single file. At least, you need to open and read the bytes of the source files in the stripe level. Probably, you missed your specific requirement in this question. > This will save a lot of time when we have too many small files to combine > into one single file. Let me assume that you aim for `Stripe`-level concatenation instead of reading columns or records. In the case of many small files, the above claim is the usual first approach and true in some cases. However, in my experience, it could be badly wrong in many tiny ORC file cases because stripes are the unit of compression and processing. The simply-merged single gigantic ORC file may be not the best you can imagine. I'd recommend trying both the compaction style (read/sort/write) and the simple stripe-level concatenation together first. After the real experiment, you can choose based on data about - Your ORC data content characteristic - File size reduction ratio after processing (Not only the columnar encoding benefit but also you may want to switch the compression codec too) - Your downstream consumers' usage pattern (access frequency and input split handling) BTW, these days, you had better consider Apache Iceberg which provides more features on top of ORC file level features. Dongjoon. On Fri, Jul 8, 2022 at 7:46 PM satyajit vegesna <satyajit.apas...@gmail.com> wrote: > > Hi Community, > > I would like to understand if there is already a way to combine two small > files, without having to read and write both the files into a single file. > > This will save a lot of time when we have too many small files to combine > into one single file. > > Is it because of the internal metadata and structure that holds to combine > files? or any other reason. > > Regards.