Rajmund Takacs created NIFI-12241:
-------------------------------------
Summary: Efficient Parquet Splitting
Key: NIFI-12241
URL: https://issues.apache.org/jira/browse/NIFI-12241
Project: Apache NiFi
Issue Type: New Feature
Components: Extensions
Reporter: Rajmund Takacs
Assignee: Rajmund Takacs
SplitParquet processor that expects as input a FlowFile with Parquet content
and would take as parameter a number of records as the split configuration.
The processor would generate X flow files with unmodified content and would add
attributes with the offsets required to read the group of rows in the
flowfile's content.
Then the Parquet Reader would be improved to accept optional flow file
attributes containing the information so that the reader can only read the
required part of the data.
Instead of having something like
{noformat}
X -> SplitRecord (Parquet / JSON) -> ...{noformat}
It'd be something like
{noformat}
X -> SplitParquet -> ConvertRecord (Parquet / JSON) -> ...{noformat}
The goal here is to increase the overall efficiency of this operation for
extremely large Parquet files (hundreds of GBs). With the second approach, it
could leverage multi-threading for processing a single file.
SplitParquet processor should also have a property (true/false) to write
zero-content flow files. The existing FetchParquet processor should be enhanced
to accept the flow file attributes for giving offsets. It'd give something like
{noformat}
X -> SplitParquet -> FetchParquet (JSON Writer) -> ...{noformat}
This way, a load balanced connection could be used between SplitParquet and
FetchParquet in order to distribute the work across the nodes (without
transferring a lot of data across the nodes of the cluster).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)