Ryan, The 10 seconds appears to be a hard-code rule in the processor, although it seems like it could be turned into a configurable property.
It would require a code change to make it grab a batch of flow files during a single execution. In theory it shouldn't provide that much of a difference, but might be an interesting experiment. It makes the code more challenging to write though, not that that's a reason not to do it. If you have a 5 node cluster, you are doing List on primary node and then redistributing the results to all the nodes via an RPG so all nodes can fetch? -Bryan On Tue, Oct 31, 2017 at 3:43 PM, Ryan Ward <[email protected]> wrote: > Joe/Bryan Thanks! > > I believe the one specific file per concurrent task/connection (and too > many threads) is the issue I have we have a lot of small files and often > times backed up . I'm going to drop the task count to take advantage of the > pooling. Is it possible to have Fetch do batches vs a single file? Would > that improve throughput? Also is that 10 seconds configurable? > > Some background: I'm converting 2 single nodes into a 5 node cluster and > trying to figure out the best approach. > > Thanks again! > > > > On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende <[email protected]> wrote: > >> Ryan, >> >> Personally I don't have experience running these processors at scale, >> but from a code perspective they are fundamentally different... >> >> GetSFTP is a source processor, meaning is not being fed by an upstream >> connection, so when it executes it can create a connection and >> retrieve up to max-selects during that one execution. >> >> FetchSFTP is being told to fetch one specific file, typically through >> attributes on incoming flow files, so the concept of max-selects >> doesn't really apply because there is only thing to select during an >> execution of the processor. >> >> FetchSFTP does employ connection pooling behind the scenes such that >> it will keep open a connection for each concurrent task, as long as >> each connection continues to be used with in 10 seconds. >> >> -Bryan >> >> >> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt <[email protected]> wrote: >> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i >> > can confirm there are users at that range for it. >> > >> > Thanks >> > >> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <[email protected]> >> wrote: >> >> I've found that on a single node getSFTP is able to pull more files off >> a >> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a >> max >> >> selects so it is requiring way more connections (one per file?) and >> >> concurrent threads to keep up. >> >> >> >> Was wondering if anyone is using List/Fetch at scale? In the multi TB's >> a >> >> day range? >> >> >> >> Thanks, >> >> Ryan >>
