Hi,

I want to convert huge dataset (ex. 1TB) from database to parquet file, because 
of file system size and memory limitation, it's not possible to create one 
single parquet file and store it in file system, instead I plan to read data 
from db as a small chunk (ex. 100 or 1000 rows) at a time so that I create a 
row group for this chunk and as soon as binary(parquet conversion) for this 
chunk(single row group) is ready I upload it to S3 and don't wait for whole 
parquet binary file to be finished.

I am using parquet-cpp library for this project and I can see that library 
supports very limited functionality(take whole table information and store it 
as one single parquet file in the file system).  <== which is not possible in 
my case

Is it possible to use parquet cpp library in below way?
instead of providing file name to the library if I can provide named pipe 
(FIFO) then whenever library update the content into FIFO, one process upload 
the content directly to the S3 in the background hence we can create one big 
parquet file without storing whole file in file system or memory. 
- To achieve that I tried passing FIFO file name instead of actual file name to 
the library but I got 
                    “Parquet write error: Arrow error: IOError: lseek failed" 
ERROR
 is it because parquet cpp library does not support FIFO as file name? If yes, 
is there other way I can create parquet file? 
- I can create one parquet file for each chunk (100 or 1000 rows) but this will 
create huge no of parquet files instead I want to create one parquet file for 
100s or 1000s of chunks (creating partial parquet file for each chunk and 
upload immediately to S3) even if I can not store all of these chunks together 
in memory or filesystem.

Hope my question is clear :) thanks in advance!

Reply via email to