Adam Turley created NIFI-15681:
----------------------------------

             Summary: Enhance PutElasticsearchJson to support NDJSON, JSON 
Array, and Single JSON input formats with size-based batching
                 Key: NIFI-15681
                 URL: https://issues.apache.org/jira/browse/NIFI-15681
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Extensions
    Affects Versions: 2.8.0
         Environment: Containerized NiFi 2.8.0 on Rhel 9
            Reporter: Adam Turley


The existing PutElasticsearchJson processor is limited to indexing one JSON 
document per FlowFile. This creates significant overhead in high-volume ingest 
scenarios, requiring upstream flow logic to reshape data before it can be sent 
to Elasticsearch. Additionally, ingesting large datasets requires one FlowFile 
per document, creating excessive NiFi session overhead and making it 
impractical to send pre-aggregated NDJSON or JSON array payloads directly.

This improvement enhances PutElasticsearchJson in-place while remaining fully 
backwards compatible with existing flows. No schema, Record Reader, or schema 
registry is required — JSON is passed through directly, making it suitable for 
dynamic or schema-less documents.



Why not PutElasticsearchRecord?

PutElasticsearchRecord is the right choice when data arrives in a structured, 
well-known format (Avro, CSV, Parquet, etc.) and field-level type mapping, 
schema enforcement, or schema evolution is needed. However, it introduces 
significant overhead that is unnecessary in many JSON ingest pipelines:
 * Schema requirement — a Record Reader and schema (via schema registry, 
inferred, or embedded) must be defined and maintained. For JSON data with 
dynamic fields, deeply nested structures, or schema-less designs, this is a 
configuration burden with no benefit.
 * Deserialization cost — PutElasticsearchRecord fully deserializes the input 
into NiFi's internal Record object model and then re-serializes it to JSON for 
the _bulk request. This is a two-way type conversion for data that is already 
valid JSON, adding CPU and memory overhead on every document.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to