Re: Create parquet file chunks using Parquet CPP library

Deepak Majeti Sat, 21 Jul 2018 17:30:01 -0700

To achieve your goal, you must implement your own OutputStream for S3.
You can see an example implementation, InMemoryOutputStream in the files
below.


https://github.com/apache/parquet-cpp/blob/6ab16f3ae8e4a76ea28a704d88267bb342ba407b/src/parquet/util/memory.h
https://github.com/apache/parquet-cpp/blob/6ab16f3ae8e4a76ea28a704d88267bb342ba407b/src/parquet/util/memory.cc

With the S3OutputStream implementation, you can then create a
ParquetFileWriter using the Open() API in the file below

https://github.com/apache/parquet-cpp/blob/6ab16f3ae8e4a76ea28a704d88267bb342ba407b/src/parquet/file_writer.cc


On Sat, Jul 21, 2018 at 12:25 AM [email protected] <[email protected]>
wrote:

> Hi,
>
> I want to convert huge dataset (ex. 1TB) from database to parquet file,
> because of file system size and memory limitation, it's not possible to
> create one single parquet file and store it in file system, instead I plan
> to read data from db as a small chunk (ex. 100 or 1000 rows) at a time so
> that I create a row group for this chunk and as soon as binary(parquet
> conversion) for this chunk(single row group) is ready I upload it to S3 and
> don't wait for whole parquet binary file to be finished.
>
> I am using parquet-cpp library for this project and I can see that library
> supports very limited functionality(take whole table information and store
> it as one single parquet file in the file system).  <== which is not
> possible in my case
>
> Is it possible to use parquet cpp library in below way?
> instead of providing file name to the library if I can provide named pipe
> (FIFO) then whenever library update the content into FIFO, one process
> upload the content directly to the S3 in the background hence we can create
> one big parquet file without storing whole file in file system or memory.
> - To achieve that I tried passing FIFO file name instead of actual file
> name to the library but I got
>                     “Parquet write error: Arrow error: IOError: lseek
> failed" ERROR
>  is it because parquet cpp library does not support FIFO as file name? If
> yes, is there other way I can create parquet file?
> - I can create one parquet file for each chunk (100 or 1000 rows) but this
> will create huge no of parquet files instead I want to create one parquet
> file for 100s or 1000s of chunks (creating partial parquet file for each
> chunk and upload immediately to S3) even if I can not store all of these
> chunks together in memory or filesystem.
>
> Hope my question is clear :) thanks in advance!
>


-- 
regards,
Deepak Majeti

Re: Create parquet file chunks using Parquet CPP library

Reply via email to