Rajmund Takacs created NIFI-12241:
-------------------------------------

             Summary: Efficient Parquet Splitting
                 Key: NIFI-12241
                 URL: https://issues.apache.org/jira/browse/NIFI-12241
             Project: Apache NiFi
          Issue Type: New Feature
          Components: Extensions
            Reporter: Rajmund Takacs
            Assignee: Rajmund Takacs


SplitParquet processor that expects as input a FlowFile with Parquet content 
and would take as parameter a number of records as the split configuration.

The processor would generate X flow files with unmodified content and would add 
attributes with the offsets required to read the group of rows in the 
flowfile's content.

Then the Parquet Reader would be improved to accept optional flow file 
attributes containing the information so that the reader can only read the 
required part of the data.

Instead of having something like
{noformat}
X -> SplitRecord (Parquet / JSON) -> ...{noformat}
It'd be something like
{noformat}
X -> SplitParquet -> ConvertRecord (Parquet / JSON) -> ...{noformat}
The goal here is to increase the overall efficiency of this operation for 
extremely large Parquet files (hundreds of GBs). With the second approach, it 
could leverage multi-threading for processing a single file.

SplitParquet processor should also have a property (true/false) to write 
zero-content flow files. The existing FetchParquet processor should be enhanced 
to accept the flow file attributes for giving offsets. It'd give something like
{noformat}
X -> SplitParquet -> FetchParquet (JSON Writer) -> ...{noformat}
This way, a load balanced connection could be used between SplitParquet and 
FetchParquet in order to distribute the work across the nodes (without 
transferring a lot of data across the nodes of the cluster).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to