[
https://issues.apache.org/jira/browse/NIFI-15681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063852#comment-18063852
]
Adam Turley commented on NIFI-15681:
------------------------------------
I've made a significant number of changes, but kept the original Single JSON
per FlowFile behavior intact to preserve backward compatibility — existing
flows should not require any reconfiguration.
In benchmarking this typically runs 2–3x faster than PutElasticsearchRecord,
largely because there is no schema definition required. The processor accepts
raw JSON directly, which also makes it much simpler to configure.
One naming change worth calling out: the existing "Batch Size" property (which
controls how many FlowFiles are grouped per Elasticsearch _bulk request) has
been renamed to "Max FlowFiles Per Batch". This was done to clearly distinguish
it from the new "Max Batch Size" property, which controls the maximum payload
size in bytes per request. A property migration is included so existing flows
upgrade automatically. (I'm also fine changing this to something else if it
makes more sense)
Thanks for your time in advance!
> Enhance PutElasticsearchJson to support NDJSON, JSON Array, and Single JSON
> input formats with size-based batching
> ------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-15681
> URL: https://issues.apache.org/jira/browse/NIFI-15681
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Affects Versions: 2.8.0
> Environment: Containerized NiFi 2.8.0 on Rhel 9
> Reporter: Adam Turley
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The existing PutElasticsearchJson processor is limited to indexing one JSON
> document per FlowFile. This creates significant overhead in high-volume
> ingest scenarios, requiring upstream flow logic to reshape data before it can
> be sent to Elasticsearch. Additionally, ingesting large datasets requires one
> FlowFile per document, creating excessive NiFi session overhead and making it
> impractical to send pre-aggregated NDJSON or JSON array payloads directly.
> This improvement enhances PutElasticsearchJson in-place while remaining fully
> backwards compatible with existing flows. No schema, Record Reader, or schema
> registry is required — JSON is passed through directly, making it suitable
> for dynamic or schema-less documents.
> Why not PutElasticsearchRecord?
> PutElasticsearchRecord is the right choice when data arrives in a structured,
> well-known format (Avro, CSV, Parquet, etc.) and field-level type mapping,
> schema enforcement, or schema evolution is needed. However, it introduces
> significant overhead that is unnecessary in many JSON ingest pipelines:
> * Schema requirement — a Record Reader and schema (via schema registry,
> inferred, or embedded) must be defined and maintained. For JSON data with
> dynamic fields, deeply nested structures, or schema-less designs, this is a
> configuration burden with no benefit.
> * Deserialization cost — PutElasticsearchRecord fully deserializes the input
> into NiFi's internal Record object model and then re-serializes it to JSON
> for the _bulk request. This is a two-way type conversion for data that is
> already valid JSON, adding CPU and memory overhead on every document.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)