[GitHub] markap14 commented on a change in pull request #3223: NIFI-5903: Allow RecordPath to be used in QueryRecord processor. Also…

GitBox Mon, 11 Feb 2019 07:39:41 -0800

markap14 commented on a change in pull request #3223: NIFI-5903: Allow 
RecordPath to be used in QueryRecord processor. Also…
URL: https://github.com/apache/nifi/pull/3223#discussion_r255561272


 ##########
 File path: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/resources/docs/org.apache.nifi.processors.standard.QueryRecord/additionalDetails.html
 ##########
 @@ -30,19 +31,525 @@
                that is responsible for writing the results out. By using this 
paradigm, users are not forced to
                convert their data from one format to another just to query it, 
and then transform the data back
                into the form that they want. Rather, the appropriate 
Controller Service can easily be configured
-               and put to use for the appropriate data format. 
+               and put to use for the appropriate data format.
        </p>
-       
+
        <p>
                Rather than providing a single "SQL SELECT Statement" type of 
Property, this Processor makes use
                of user-defined properties. Each user-defined property that is 
added to the Processor has a name
                that becomes a new Relationship for the Processor and a 
corresponding SQL query that will be evaluated
                against each FlowFile. This allows multiple SQL queries to be 
run against each FlowFile.
        </p>
-       
+
        <p>
                        The SQL syntax that is supported by this Processor is 
ANSI SQL and is powered by Apache Calcite. Please
                        note that identifiers are quoted using double-quotes, 
and column names/labels are case-insensitive.
        </p>
+
+        <p>
+            As an example, let's consider that we have a FlowFile with the 
following CSV data:
+        </p>
+        <pre><code>
+            name, age, title
+            John Doe, 34, Software Engineer
+            Jane Doe, 30, Program Manager
+            Jacob Doe, 45, Vice President
+            Janice Doe, 46, Vice President
+        </code></pre>
+
+        <p>
+            Now consider that we add the following properties to the Processor:
+        </p>
+        <table>
+            <tr>
+                <th>Property Name</th>
+                <th>Property Value</th>
+            </tr>
+            <tr>
+                <td>Engineers</td>
+                <td>SELECT * FROM FLOWFILE WHERE title LIKE '%Engineer%'</td>
+            </tr>
+            <tr>
+                <td>VP</td>
+                <td>SELECT name FROM FLOWFILE WHERE title = 'Vice 
President'</td>
+            </tr>
+            <tr>
+                <td>Younger Than Average</td>
+                <td>SELECT * FROM FLOWFILE WHERE age < (SELECT AVG(age) FROM 
FLOWFILE)</td>
+            </tr>
+        </table>
+
+        <p>
+            This Processor will now have five relationships: 
<code>original</code>, <code>failure</code>, <code>Engineers</code>, 
<code>VP</code>, and <code>Younger Than Average</code>.
+            If there is a failure processing the FlowFile, then the original 
FlowFile will be routed to <code>failure</code>. Otherwise, the original 
FlowFile will be routed to <code>original</code>
+            and one FlowFile will be routed to each of the other 
relationships, with the following values:
+        </p>
+
+        <table>
+            <tr>
+                <th>Relationship Name</th>
+                <th>FlowFile Value</th>
+            </tr>
+            <tr>
+                <td>Engineers</td>
+                <td>
+                    <pre><code>
+                        name, age, title
+                        John Doe, 34, Software Engineer
+                    </code></pre>
+                </td>
+            </tr>
+            <tr>
+                <td>VP</td>
+                <td>
+                    <pre><code>
+                        name
+                        Jacob Doe
+                        Janice Doe
+                    </code></pre>
+                </td>
+            </tr>
+            <tr>
+                <td>Younger Than Average</td>
+                <td>
+                    <pre><code>
+                        name, age, title
+                        John Doe, 34, Software Engineer
+                        Jane Doe, 30, Program Manager
+                    </code></pre>
+                </td>
+            </tr>
+        </table>
+
+        <p>
+            Note that this example is intended to illustrate the data that is 
input and output from the Processor. The actual format of the data may vary, 
depending on the configuration of the
+            Record Reader and Record Writer that is used. For example, here we 
assume that we are using a CSV Reader and a CSV Writer and that both are 
configured to have a header line. Should we have
+            used a JSON Writer instead, the output would have contained the 
same information but been presented in JSON Output. The user is able to choose 
which input and output format make the most
+            since for his or her use case. The input and output formats need 
not be the same.
+        </p>
+
+        <p>
+            It is also worth noting that the outbound FlowFiles have two 
different schemas. The <code>Engineers</code> and <code>Younger Than 
Average</code> FlowFiles contain 3 fields:
+            <code>name</code>, <code>age</code>, and <code>title</code> while 
the <code>VP</code> FlowFile contains only the <code>name</code> field. In most 
cases, the Record Writer is configured to
+            use whatever Schema is provided to it by the Record (this 
generally means that it is configured with a <code>Schema Access 
Strategy</code> of <code>Inherit Record Schema</code>). In such
+            a case, this works well. However, if a Schema is supplied to the 
Record Writer explicitly, it is important to ensure that the Schema accounts 
for all fields. If not, then then the
+            fields that are missing from the Record Writer's schema will 
simply not be present in the output.
+        </p>
+
+
+        <h3>SQL Over Hierarchical Data</h3>
+        <p>
+            One important detail that we must taken into account when 
evaluating SQL over streams of arbitrary data is how
+            we can handle hierarchical data, such as JSON, XML, and Avro. 
Because SQL was developed originally for relational databases, which
+            represent "flat" data, it is easy to understand how this would map 
to other "flat" data like a CSV file. Or even
+            a "flat" JSON representation where all fields are primitive types. 
However, in many cases, users encounter cases where they would like to evaluate 
SQL
+            over JSON or Avro data that is made up of many nested values. For 
example, consider the following JSON as input:
+        </p>
+
+        <pre><code>
+            {
+              "name": "John Doe",
+              "title": "Software Engineer",
+              "age": 40,
+              "addresses": [{
+                  "streetNumber": 4820,
+                  "street": "My Street",
+                  "apartment": null,
+                  "city": "New York",
+                  "state": "NY",
+                  "country": "USA",
+                  "label": "work"
+              }, {
+                  "streetNumber": 327,
+                  "street": "Small Street",
+                  "apartment": 309,
+                  "city": "Los Angeles",
+                  "state": "CA",
+                  "country": "USA",
+                  "label": "home"
+              }],
+              "project": {
+                  "name": "Apache NiFi",
+                  "maintainer": {
+                        "id": 28302873,
+                        "name": "Apache Software Foundation"
+                   },
+                  "debutYear": 2014
+              }
+            }
+        </code></pre>
+
+        <p>
+            Consider a query that will select the title and name of any person 
who has a home address in a different state
+            than their work address. Here, we can only select the fields 
<code>name</code>, <code>title</code>,
+            <code>age</code>, and <code>addresses</code>. In this scenario, 
<code>addresses</code> represents an Array of complex
+            objects - records. In order to accommodate for this, QueryRecord 
provides User-Defined Functions to enable
+            <a href="../../../../../html/record-path-guide.html">Record 
Path</a> to be used. Record Path is a simple NiFi Domain Specific Language (DSL)
+            that allows users to reference a nested structure.
+        </p>
+
+        <p>
+            The primary User-Defined Function that will be used is named 
<code>RPATH</code> (short for Record Path). This function expects exactly two 
arguments:
+            the Record to evaluate the RecordPath against, and the RecordPath 
to evaluate (in that order).
+            So, to select the title and name of any person who has a home 
address in a different state than their work address, we can use
+            the following SQL statement:
+        </p>
+
+        <code><pre>
+            SELECT title, name
+            FROM FLOWFILE
+            WHERE RPATH(addresses, '/state[/label = ''home'']') <>
+                  RPATH(addresses, '/state[/label = ''work'']')
+        </pre></code>
+
+        <p>
+            To explain this query in English, we can say that it selects the 
"title" and "name" fields from any Record in the FlowFile for which there is an 
address whose "label" is "home" and
+            another address whose "label" is "work" and for which the two 
addreses have different states.
+        </p>
+
+        <p>
+            Similarly, we could select the entire Record (all fields) of any 
person who has a "project" whose maintainer is the Apache Software Foundation 
using the query:
+        </p>
+
+        <code><pre>
+            SELECT *
+            FROM FLOWFILE
+            WHERE RPATH(project, '/maintainer/name') = 'Apache Software 
Foundation'
+        </pre></code>
+
+        <p>
+            There does exist a caveat, though, when using RecordPath. That is 
that the <code>RPATH</code> function returns an <code>Object</code>, which in 
JDBC is represented as an <code>OTHER</code>
+            type. This is fine and does not affect anything when it is used 
like above. However, what if we wanted to use another SQL function on the 
result? For example, what if we wanted to use
+            the SQL query <code>SELECT * FROM FLOWFILE WHERE RPATH(project, 
'/maintainer/name') LIKE 'Apache%'</code>? This would fail with a very long 
error such as:
+        </p>
+
+        <code><pre>
 
 Review comment:
   Re: coercion: I don't think we have all necessary information to determine 
that we need to perform the coercion before such an Exception occurs. I could 
be wrong, but I don't see a clean way to avoid it...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] markap14 commented on a change in pull request #3223: NIFI-5903: Allow RecordPath to be used in QueryRecord processor. Also…

Reply via email to