[jira] [Updated] (NIFI-11985) Implement a processor to consume documents from Elasticsearch indices

Chris Sampson (Jira) Wed, 23 Aug 2023 12:32:29 -0700


     [ 
https://issues.apache.org/jira/browse/NIFI-11985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Sampson updated NIFI-11985:
---------------------------------
    Description: 
It is possible to use Elasticsearch to store series data, i.e. data is 
continually added to an Elasticsearch index over time, with a {{date}} or a 
1-up numeric {{long}} field.

This is more likely with the advent of [Data 
Streams](https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html)
 or the recent [Time Series Data 
Streams](https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html),
 both of which use a {{@timestamp}} field to indicate when a document was added 
to the stream.

There are use cases where NiFi users may want to consume new data from the 
Elasticsearch index/data stream after it's arrived, then pass it to another 
service.

NiFi would need to:
* know which field to use as the "series field" (e.g. {{@timestamp}})
* track the last read "series field" value via State so that the same documents 
are not retrieved from Elasticsearch multiple times
* allow for the optional specification of the "last read" field value, e.g. if 
a user wants to offset the start of the documents to be read (this value should 
only be used if a value doesn't also exist within the processor's State)
* allow for the fact that the "last read" vlaue will be blank when the 
processor is first run (and the value is not otherwise specified), meaning we 
want to retrieve all existing data
* allow for users to specify an optional Query Filter to apply to the search 
within Elasticsearch when finding documents to retrieve

Possible implementations should consider using the {{SearchElasticsearch}} 
processor as a basis, which already uses State tracking between processor 
executions and allows for the retrieval of Elasticsearch documents in a 
paginated manner (thus avoiding pulling too much data in a single request).

  was:
It is possible to use Elasticsearch to store series data, i.e. data is 
continually added to an Elasticsearch index over time, with a {{date}} or a 
1-up numeric {{long}} field.

This is more likely with the advent of [Data 
Streams](https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html)
 or the recent [Time Series Data 
Streams](https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html),
 both of which use a {{@timestamp}} field to indicate when a document was added 
to the stream.

There are use cases where NiFi users may want to consume new data from the 
Elasticsearch index/data stream after it's arrived, then pass it to another 
service.

NiFi would need to know which field to use as the "series field" (e.g. 
{{@timestamp}}) and track this via State so that the same documents are not 
retrieved from Elasticsearch multiple times. Possible implementations should 
consider using the {{SearchElasticsearch}} processor as a basis, which already 
uses State tracking between processor executions and allows for the retrieval 
of Elasticsearch documents in a paginated manner (thus avoiding pulling too 
much data in a single request).


> Implement a processor to consume documents from Elasticsearch indices
> ---------------------------------------------------------------------
>
>                 Key: NIFI-11985
>                 URL: https://issues.apache.org/jira/browse/NIFI-11985
>             Project: Apache NiFi
>          Issue Type: New Feature
>            Reporter: Chris Sampson
>            Priority: Minor
>
> It is possible to use Elasticsearch to store series data, i.e. data is 
> continually added to an Elasticsearch index over time, with a {{date}} or a 
> 1-up numeric {{long}} field.
> This is more likely with the advent of [Data 
> Streams](https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html)
>  or the recent [Time Series Data 
> Streams](https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html),
>  both of which use a {{@timestamp}} field to indicate when a document was 
> added to the stream.
> There are use cases where NiFi users may want to consume new data from the 
> Elasticsearch index/data stream after it's arrived, then pass it to another 
> service.
> NiFi would need to:
> * know which field to use as the "series field" (e.g. {{@timestamp}})
> * track the last read "series field" value via State so that the same 
> documents are not retrieved from Elasticsearch multiple times
> * allow for the optional specification of the "last read" field value, e.g. 
> if a user wants to offset the start of the documents to be read (this value 
> should only be used if a value doesn't also exist within the processor's 
> State)
> * allow for the fact that the "last read" vlaue will be blank when the 
> processor is first run (and the value is not otherwise specified), meaning we 
> want to retrieve all existing data
> * allow for users to specify an optional Query Filter to apply to the search 
> within Elasticsearch when finding documents to retrieve
> Possible implementations should consider using the {{SearchElasticsearch}} 
> processor as a basis, which already uses State tracking between processor 
> executions and allows for the retrieval of Elasticsearch documents in a 
> paginated manner (thus avoiding pulling too much data in a single request).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NIFI-11985) Implement a processor to consume documents from Elasticsearch indices

Reply via email to