[ 
https://issues.apache.org/jira/browse/NIFI-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821114#comment-15821114
 ] 

ASF GitHub Bot commented on NIFI-2881:
--------------------------------------

Github user mattyb149 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/1407#discussion_r95801717
  
    --- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/GenerateTableFetch.java
 ---
    @@ -115,20 +128,36 @@ public GenerateTableFetch() {
     
         @OnScheduled
         public void setup(final ProcessContext context) {
    +        // The processor is invalid if there is an incoming connection and 
max-value columns are defined
    +        if (context.getProperty(MAX_VALUE_COLUMN_NAMES).isSet() && 
context.hasIncomingConnection()) {
    +            throw new ProcessException("If an incoming connection is 
supplied, no max-value column names may be specified");
    --- End diff --
    
    I thought about supporting the older format, but that could lead to 
problems depending on which table name you pass in. Using your "users" and 
"purchase_histories" tables above, let's say I was running with the old version 
and a hard-coded "purchase_histories" table, which stores "last_updated" in the 
state map.  Then with the new version the first table name I pass in via an 
attribute is "users".  I will not find "users.last_updated" so I would check 
for just "last_updated", whose value is not associated with the users table but 
rather the purchase_histories table.  This is an edge case but I would hate to 
see the very first run of the processor fail when it used to work.
    
    I do like the example you have of a mapping of tables to max-value columns, 
I think other products (GoldenGate or Sqoop or something?) allows for this 
flexibility (you just have to provide your own map). If we end up supporting 
Max-Value columns with incoming connections then I will make sure this 
capability is present.
    
    I'm most worried about the arbitrary number of entries in the state map.  
Once the total size gets above 1MB, I think ZooKeeper starts acting strangely 
and I certainly wouldn't want this processor to affect the entire NiFi system.  
This too is an edge case, since I imagine most entries are small (~64 bytes 
max?) so it would only happen if the number of tables/columns was very large, 
or for some reason the max-values were large.  Historically I've seen 
limitations placed on other NiFi resources (threads, e.g.) to ensure a discrete 
maximum to avoid the issues due to arbitrarily large things.
    
    I would very much like to allow for the Max-Value columns, do you have any 
suggestions on how the processor should behave in the face of an arbitrarily 
large state map? Perhaps we could set an artificial limit on number of entries 
(implying a limit on the number of table/columns) and route future flow files 
(whose table is not yet present in the state map) to a "state map full" 
relationship or something.


> Allow Database Fetch processor(s) to accept incoming flow files and use 
> Expression Language
> -------------------------------------------------------------------------------------------
>
>                 Key: NIFI-2881
>                 URL: https://issues.apache.org/jira/browse/NIFI-2881
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Matt Burgess
>            Assignee: Matt Burgess
>
> The QueryDatabaseTable and GenerateTableFetch processors do not allow 
> Expression Language to be used in the properties, mainly because they also do 
> not allow incoming connections. This means if the user desires to fetch from 
> multiple tables, they currently need one instance of the processor for each 
> table, and those table names must be hard-coded.
> To support the same capabilities for multiple tables and more flexible 
> configuration via Expression Language, these processors should have 
> properties that accept Expression Language, and GenerateTableFetch should 
> accept (optional) incoming connections.
> Conversation about the behavior of the processors is welcomed and encouraged. 
> For example, if an incoming flow file is available, do we also still run the 
> incremental fetch logic for tables that aren't specified by this flow file, 
> or do we just do incremental fetching when the processor is scheduled but 
> there is no incoming flow file. The latter implies a denial-of-service could 
> take place, by flooding the processor with flow files and not letting it do 
> its original job of querying the table, keeping track of maximum values, etc.
> This is likely a breaking change to the processors because of how state 
> management is implemented. Currently since the table name is hard coded, only 
> the column name comprises the key in the state. This would have to be 
> extended to have a compound key that represents table name, max-value column 
> name, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to