markap14 commented on a change in pull request #3223: NIFI-5903: Allow
RecordPath to be used in QueryRecord processor. Also…
URL: https://github.com/apache/nifi/pull/3223#discussion_r255561272
##########
File path:
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/resources/docs/org.apache.nifi.processors.standard.QueryRecord/additionalDetails.html
##########
@@ -30,19 +31,525 @@
that is responsible for writing the results out. By using this
paradigm, users are not forced to
convert their data from one format to another just to query it,
and then transform the data back
into the form that they want. Rather, the appropriate
Controller Service can easily be configured
- and put to use for the appropriate data format.
+ and put to use for the appropriate data format.
</p>
-
+
<p>
Rather than providing a single "SQL SELECT Statement" type of
Property, this Processor makes use
of user-defined properties. Each user-defined property that is
added to the Processor has a name
that becomes a new Relationship for the Processor and a
corresponding SQL query that will be evaluated
against each FlowFile. This allows multiple SQL queries to be
run against each FlowFile.
</p>
-
+
<p>
The SQL syntax that is supported by this Processor is
ANSI SQL and is powered by Apache Calcite. Please
note that identifiers are quoted using double-quotes,
and column names/labels are case-insensitive.
</p>
+
+ <p>
+ As an example, let's consider that we have a FlowFile with the
following CSV data:
+ </p>
+ <pre><code>
+ name, age, title
+ John Doe, 34, Software Engineer
+ Jane Doe, 30, Program Manager
+ Jacob Doe, 45, Vice President
+ Janice Doe, 46, Vice President
+ </code></pre>
+
+ <p>
+ Now consider that we add the following properties to the Processor:
+ </p>
+ <table>
+ <tr>
+ <th>Property Name</th>
+ <th>Property Value</th>
+ </tr>
+ <tr>
+ <td>Engineers</td>
+ <td>SELECT * FROM FLOWFILE WHERE title LIKE '%Engineer%'</td>
+ </tr>
+ <tr>
+ <td>VP</td>
+ <td>SELECT name FROM FLOWFILE WHERE title = 'Vice
President'</td>
+ </tr>
+ <tr>
+ <td>Younger Than Average</td>
+ <td>SELECT * FROM FLOWFILE WHERE age < (SELECT AVG(age) FROM
FLOWFILE)</td>
+ </tr>
+ </table>
+
+ <p>
+ This Processor will now have five relationships:
<code>original</code>, <code>failure</code>, <code>Engineers</code>,
<code>VP</code>, and <code>Younger Than Average</code>.
+ If there is a failure processing the FlowFile, then the original
FlowFile will be routed to <code>failure</code>. Otherwise, the original
FlowFile will be routed to <code>original</code>
+ and one FlowFile will be routed to each of the other
relationships, with the following values:
+ </p>
+
+ <table>
+ <tr>
+ <th>Relationship Name</th>
+ <th>FlowFile Value</th>
+ </tr>
+ <tr>
+ <td>Engineers</td>
+ <td>
+ <pre><code>
+ name, age, title
+ John Doe, 34, Software Engineer
+ </code></pre>
+ </td>
+ </tr>
+ <tr>
+ <td>VP</td>
+ <td>
+ <pre><code>
+ name
+ Jacob Doe
+ Janice Doe
+ </code></pre>
+ </td>
+ </tr>
+ <tr>
+ <td>Younger Than Average</td>
+ <td>
+ <pre><code>
+ name, age, title
+ John Doe, 34, Software Engineer
+ Jane Doe, 30, Program Manager
+ </code></pre>
+ </td>
+ </tr>
+ </table>
+
+ <p>
+ Note that this example is intended to illustrate the data that is
input and output from the Processor. The actual format of the data may vary,
depending on the configuration of the
+ Record Reader and Record Writer that is used. For example, here we
assume that we are using a CSV Reader and a CSV Writer and that both are
configured to have a header line. Should we have
+ used a JSON Writer instead, the output would have contained the
same information but been presented in JSON Output. The user is able to choose
which input and output format make the most
+ since for his or her use case. The input and output formats need
not be the same.
+ </p>
+
+ <p>
+ It is also worth noting that the outbound FlowFiles have two
different schemas. The <code>Engineers</code> and <code>Younger Than
Average</code> FlowFiles contain 3 fields:
+ <code>name</code>, <code>age</code>, and <code>title</code> while
the <code>VP</code> FlowFile contains only the <code>name</code> field. In most
cases, the Record Writer is configured to
+ use whatever Schema is provided to it by the Record (this
generally means that it is configured with a <code>Schema Access
Strategy</code> of <code>Inherit Record Schema</code>). In such
+ a case, this works well. However, if a Schema is supplied to the
Record Writer explicitly, it is important to ensure that the Schema accounts
for all fields. If not, then then the
+ fields that are missing from the Record Writer's schema will
simply not be present in the output.
+ </p>
+
+
+ <h3>SQL Over Hierarchical Data</h3>
+ <p>
+ One important detail that we must taken into account when
evaluating SQL over streams of arbitrary data is how
+ we can handle hierarchical data, such as JSON, XML, and Avro.
Because SQL was developed originally for relational databases, which
+ represent "flat" data, it is easy to understand how this would map
to other "flat" data like a CSV file. Or even
+ a "flat" JSON representation where all fields are primitive types.
However, in many cases, users encounter cases where they would like to evaluate
SQL
+ over JSON or Avro data that is made up of many nested values. For
example, consider the following JSON as input:
+ </p>
+
+ <pre><code>
+ {
+ "name": "John Doe",
+ "title": "Software Engineer",
+ "age": 40,
+ "addresses": [{
+ "streetNumber": 4820,
+ "street": "My Street",
+ "apartment": null,
+ "city": "New York",
+ "state": "NY",
+ "country": "USA",
+ "label": "work"
+ }, {
+ "streetNumber": 327,
+ "street": "Small Street",
+ "apartment": 309,
+ "city": "Los Angeles",
+ "state": "CA",
+ "country": "USA",
+ "label": "home"
+ }],
+ "project": {
+ "name": "Apache NiFi",
+ "maintainer": {
+ "id": 28302873,
+ "name": "Apache Software Foundation"
+ },
+ "debutYear": 2014
+ }
+ }
+ </code></pre>
+
+ <p>
+ Consider a query that will select the title and name of any person
who has a home address in a different state
+ than their work address. Here, we can only select the fields
<code>name</code>, <code>title</code>,
+ <code>age</code>, and <code>addresses</code>. In this scenario,
<code>addresses</code> represents an Array of complex
+ objects - records. In order to accommodate for this, QueryRecord
provides User-Defined Functions to enable
+ <a href="../../../../../html/record-path-guide.html">Record
Path</a> to be used. Record Path is a simple NiFi Domain Specific Language (DSL)
+ that allows users to reference a nested structure.
+ </p>
+
+ <p>
+ The primary User-Defined Function that will be used is named
<code>RPATH</code> (short for Record Path). This function expects exactly two
arguments:
+ the Record to evaluate the RecordPath against, and the RecordPath
to evaluate (in that order).
+ So, to select the title and name of any person who has a home
address in a different state than their work address, we can use
+ the following SQL statement:
+ </p>
+
+ <code><pre>
+ SELECT title, name
+ FROM FLOWFILE
+ WHERE RPATH(addresses, '/state[/label = ''home'']') <>
+ RPATH(addresses, '/state[/label = ''work'']')
+ </pre></code>
+
+ <p>
+ To explain this query in English, we can say that it selects the
"title" and "name" fields from any Record in the FlowFile for which there is an
address whose "label" is "home" and
+ another address whose "label" is "work" and for which the two
addreses have different states.
+ </p>
+
+ <p>
+ Similarly, we could select the entire Record (all fields) of any
person who has a "project" whose maintainer is the Apache Software Foundation
using the query:
+ </p>
+
+ <code><pre>
+ SELECT *
+ FROM FLOWFILE
+ WHERE RPATH(project, '/maintainer/name') = 'Apache Software
Foundation'
+ </pre></code>
+
+ <p>
+ There does exist a caveat, though, when using RecordPath. That is
that the <code>RPATH</code> function returns an <code>Object</code>, which in
JDBC is represented as an <code>OTHER</code>
+ type. This is fine and does not affect anything when it is used
like above. However, what if we wanted to use another SQL function on the
result? For example, what if we wanted to use
+ the SQL query <code>SELECT * FROM FLOWFILE WHERE RPATH(project,
'/maintainer/name') LIKE 'Apache%'</code>? This would fail with a very long
error such as:
+ </p>
+
+ <code><pre>
Review comment:
Re: coercion: I don't think we have all necessary information to determine
that we need to perform the coercion before such an Exception occurs. I could
be wrong, but I don't see a clean way to avoid it...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services