To achieve your goal, you must implement your own OutputStream for S3. You can see an example implementation, InMemoryOutputStream in the files below.
https://github.com/apache/parquet-cpp/blob/6ab16f3ae8e4a76ea28a704d88267bb342ba407b/src/parquet/util/memory.h https://github.com/apache/parquet-cpp/blob/6ab16f3ae8e4a76ea28a704d88267bb342ba407b/src/parquet/util/memory.cc With the S3OutputStream implementation, you can then create a ParquetFileWriter using the Open() API in the file below https://github.com/apache/parquet-cpp/blob/6ab16f3ae8e4a76ea28a704d88267bb342ba407b/src/parquet/file_writer.cc On Sat, Jul 21, 2018 at 12:25 AM [email protected] <[email protected]> wrote: > Hi, > > I want to convert huge dataset (ex. 1TB) from database to parquet file, > because of file system size and memory limitation, it's not possible to > create one single parquet file and store it in file system, instead I plan > to read data from db as a small chunk (ex. 100 or 1000 rows) at a time so > that I create a row group for this chunk and as soon as binary(parquet > conversion) for this chunk(single row group) is ready I upload it to S3 and > don't wait for whole parquet binary file to be finished. > > I am using parquet-cpp library for this project and I can see that library > supports very limited functionality(take whole table information and store > it as one single parquet file in the file system). <== which is not > possible in my case > > Is it possible to use parquet cpp library in below way? > instead of providing file name to the library if I can provide named pipe > (FIFO) then whenever library update the content into FIFO, one process > upload the content directly to the S3 in the background hence we can create > one big parquet file without storing whole file in file system or memory. > - To achieve that I tried passing FIFO file name instead of actual file > name to the library but I got > “Parquet write error: Arrow error: IOError: lseek > failed" ERROR > is it because parquet cpp library does not support FIFO as file name? If > yes, is there other way I can create parquet file? > - I can create one parquet file for each chunk (100 or 1000 rows) but this > will create huge no of parquet files instead I want to create one parquet > file for 100s or 1000s of chunks (creating partial parquet file for each > chunk and upload immediately to S3) even if I can not store all of these > chunks together in memory or filesystem. > > Hope my question is clear :) thanks in advance! > -- regards, Deepak Majeti
