[ 
https://issues.apache.org/jira/browse/NIFI-15681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063852#comment-18063852
 ] 

Adam Turley commented on NIFI-15681:
------------------------------------

I've made a significant number of changes, but kept the original Single JSON 
per FlowFile behavior intact to preserve backward compatibility — existing 
flows should not require any reconfiguration.

In benchmarking this typically runs 2–3x faster than PutElasticsearchRecord, 
largely because there is no schema definition required. The processor accepts 
raw JSON directly, which also makes it much simpler to configure.

One naming change worth calling out: the existing "Batch Size" property (which 
controls how many FlowFiles are grouped per Elasticsearch _bulk request) has 
been renamed to "Max FlowFiles Per Batch". This was done to clearly distinguish 
it from the new "Max Batch Size" property, which controls the maximum payload 
size in bytes per request. A property migration is included so existing flows 
upgrade automatically. (I'm also fine changing this to something else if it 
makes more sense)

Thanks for your time in advance!

> Enhance PutElasticsearchJson to support NDJSON, JSON Array, and Single JSON 
> input formats with size-based batching
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-15681
>                 URL: https://issues.apache.org/jira/browse/NIFI-15681
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>    Affects Versions: 2.8.0
>         Environment: Containerized NiFi 2.8.0 on Rhel 9
>            Reporter: Adam Turley
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The existing PutElasticsearchJson processor is limited to indexing one JSON 
> document per FlowFile. This creates significant overhead in high-volume 
> ingest scenarios, requiring upstream flow logic to reshape data before it can 
> be sent to Elasticsearch. Additionally, ingesting large datasets requires one 
> FlowFile per document, creating excessive NiFi session overhead and making it 
> impractical to send pre-aggregated NDJSON or JSON array payloads directly.
> This improvement enhances PutElasticsearchJson in-place while remaining fully 
> backwards compatible with existing flows. No schema, Record Reader, or schema 
> registry is required — JSON is passed through directly, making it suitable 
> for dynamic or schema-less documents.
> Why not PutElasticsearchRecord?
> PutElasticsearchRecord is the right choice when data arrives in a structured, 
> well-known format (Avro, CSV, Parquet, etc.) and field-level type mapping, 
> schema enforcement, or schema evolution is needed. However, it introduces 
> significant overhead that is unnecessary in many JSON ingest pipelines:
>  * Schema requirement — a Record Reader and schema (via schema registry, 
> inferred, or embedded) must be defined and maintained. For JSON data with 
> dynamic fields, deeply nested structures, or schema-less designs, this is a 
> configuration burden with no benefit.
>  * Deserialization cost — PutElasticsearchRecord fully deserializes the input 
> into NiFi's internal Record object model and then re-serializes it to JSON 
> for the _bulk request. This is a two-way type conversion for data that is 
> already valid JSON, adding CPU and memory overhead on every document.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to