[
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koji Kawamura updated NIFI-3248:
--------------------------------
Description:
GetSolr holds the last query timestamp so that it only fetches documents those
have been added or updated since the last query.
However, GetSolr misses some of those updated documents, and once the documents
date field value becomes older than last query timestamp, the document won't be
able to be queried by GetSolr any more.
This JIRA is for tracking the process of investigating this behavior, and
discussion on them.
Here are things that can be a cause of this behavior:
|#|Short description|Should we address it?|
|1|Timestamp range filter, curly or square bracket?|No|
|2|Timezone difference between update and query|Additional docs might be
helpful|
|3|Lag comes from CommitWithin|Should be documented at least|
h2. 1. Timestamp range filter, curly or square bracket?
At the first glance, using curly and square bracket in mix looked strange
([source
code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
But these difference has a meaning.
The square bracket on the range query is inclusive and the curly bracket is
exclusive. If we use inclusive on both sides and a document has a time stamp
exactly on the boundary then it could be returned in two consecutive
executions, and we only want it in one.
This is intentional, and it should be as it is.
h2. 2. Timezone difference between update and query
Solr treats date fields as [UTC
representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
If date field String value of an updated document represents time without
timezone, and NiFi is running on an environment using timezone other than UTC,
GetSolr can't perform date range query as users expect.
Let's say NiFi is running with JST(UTC+9). A process added a document to Solr
at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it as
15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any
documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC,
i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date
range filter.
To avoid this, updated documents must have proper timezone in date field string
representation.
If one uses NiFi expression language to set current timestamp to that date
field, following NiFi expression can be used:
{code}
${now():format("yyyy-MM-dd'T'HH:mm:ss.SSSZ")}
{code}
It will produce a result like:
{code}
2016-12-27T15:30:04.895+0900
{code}
Then it will be indexed in Solr with UTC and will be queried by GetSolr as
expected.
h2. 3. Lag comes from CommitWithin
was:
GetSolr holds the last query timestamp so that it only fetches documents those
have been added or updated since the last query.
However, GetSolr misses some of those updated documents, and once the documents
date field value becomes older than last query timestamp, the document won't be
able to be queried by GetSolr any more.
This JIRA is for tracking the process of investigating this behavior, and
discussion on them.
Here are things that can be a cause of this behavior:
|#|Short description|Should we address it?|
|1|Timestamp range filter, curly or square bracket?|No|
|2|Timezone difference between update and query|Additional docs might be
helpful|
h2. 1. Timestamp range filter, curly or square bracket?
At the first glance, using curly and square bracket in mix looked strange
([source
code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
But these difference has a meaning.
The square bracket on the range query is inclusive and the curly bracket is
exclusive. If we use inclusive on both sides and a document has a time stamp
exactly on the boundary then it could be returned in two consecutive
executions, and we only want it in one.
This is intentional, and it should be as it is.
h2. 2. Timezone difference between update and query
Solr treats date
> GetSolr cannot query newly added documents
> ------------------------------------------
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
> Issue Type: Bug
> Components: Extensions
> Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1,
> 1.0.1
> Reporter: Koji Kawamura
> Priority: Minor
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png,
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the
> documents date field value becomes older than last query timestamp, the
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be
> helpful|
> |3|Lag comes from CommitWithin|Should be documented at least|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange
> ([source
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
> But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is
> exclusive. If we use inclusive on both sides and a document has a time stamp
> exactly on the boundary then it could be returned in two consecutive
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
> If date field String value of an updated document represents time without
> timezone, and NiFi is running on an environment using timezone other than
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC,
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date
> range filter.
> To avoid this, updated documents must have proper timezone in date field
> string representation.
> If one uses NiFi expression language to set current timestamp to that date
> field, following NiFi expression can be used:
> {code}
> ${now():format("yyyy-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as
> expected.
> h2. 3. Lag comes from CommitWithin
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)