Github user mattyb149 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/1407#discussion_r95801717
  
    --- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/GenerateTableFetch.java
 ---
    @@ -115,20 +128,36 @@ public GenerateTableFetch() {
     
         @OnScheduled
         public void setup(final ProcessContext context) {
    +        // The processor is invalid if there is an incoming connection and 
max-value columns are defined
    +        if (context.getProperty(MAX_VALUE_COLUMN_NAMES).isSet() && 
context.hasIncomingConnection()) {
    +            throw new ProcessException("If an incoming connection is 
supplied, no max-value column names may be specified");
    --- End diff --
    
    I thought about supporting the older format, but that could lead to 
problems depending on which table name you pass in. Using your "users" and 
"purchase_histories" tables above, let's say I was running with the old version 
and a hard-coded "purchase_histories" table, which stores "last_updated" in the 
state map.  Then with the new version the first table name I pass in via an 
attribute is "users".  I will not find "users.last_updated" so I would check 
for just "last_updated", whose value is not associated with the users table but 
rather the purchase_histories table.  This is an edge case but I would hate to 
see the very first run of the processor fail when it used to work.
    
    I do like the example you have of a mapping of tables to max-value columns, 
I think other products (GoldenGate or Sqoop or something?) allows for this 
flexibility (you just have to provide your own map). If we end up supporting 
Max-Value columns with incoming connections then I will make sure this 
capability is present.
    
    I'm most worried about the arbitrary number of entries in the state map.  
Once the total size gets above 1MB, I think ZooKeeper starts acting strangely 
and I certainly wouldn't want this processor to affect the entire NiFi system.  
This too is an edge case, since I imagine most entries are small (~64 bytes 
max?) so it would only happen if the number of tables/columns was very large, 
or for some reason the max-values were large.  Historically I've seen 
limitations placed on other NiFi resources (threads, e.g.) to ensure a discrete 
maximum to avoid the issues due to arbitrarily large things.
    
    I would very much like to allow for the Max-Value columns, do you have any 
suggestions on how the processor should behave in the face of an arbitrarily 
large state map? Perhaps we could set an artificial limit on number of entries 
(implying a limit on the number of table/columns) and route future flow files 
(whose table is not yet present in the state map) to a "state map full" 
relationship or something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to