Yep that's exactly how I have it set up with a push to RPG. Is that preferred? I just started playing with it to be honest. I can see how it could be tricky if you have to pull from multiple servers each flow file could potentially have a different sftp host address in the queues.
All together we have to pull from about 60 servers. If this doesn't work out with the list/fetch I plan to have a micro acquisition cluster just for gets. Ryan On Oct 31, 2017 4:26 PM, "Bryan Bende" <[email protected]> wrote: > Ryan, > > The 10 seconds appears to be a hard-code rule in the processor, > although it seems like it could be turned into a configurable > property. > > It would require a code change to make it grab a batch of flow files > during a single execution. In theory it shouldn't provide that much of > a difference, but might be an interesting experiment. It makes the > code more challenging to write though, not that that's a reason not to > do it. > > If you have a 5 node cluster, you are doing List on primary node and > then redistributing the results to all the nodes via an RPG so all > nodes can fetch? > > -Bryan > > > On Tue, Oct 31, 2017 at 3:43 PM, Ryan Ward <[email protected]> wrote: > > Joe/Bryan Thanks! > > > > I believe the one specific file per concurrent task/connection (and too > > many threads) is the issue I have we have a lot of small files and often > > times backed up . I'm going to drop the task count to take advantage of > the > > pooling. Is it possible to have Fetch do batches vs a single file? Would > > that improve throughput? Also is that 10 seconds configurable? > > > > Some background: I'm converting 2 single nodes into a 5 node cluster and > > trying to figure out the best approach. > > > > Thanks again! > > > > > > > > On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende <[email protected]> wrote: > > > >> Ryan, > >> > >> Personally I don't have experience running these processors at scale, > >> but from a code perspective they are fundamentally different... > >> > >> GetSFTP is a source processor, meaning is not being fed by an upstream > >> connection, so when it executes it can create a connection and > >> retrieve up to max-selects during that one execution. > >> > >> FetchSFTP is being told to fetch one specific file, typically through > >> attributes on incoming flow files, so the concept of max-selects > >> doesn't really apply because there is only thing to select during an > >> execution of the processor. > >> > >> FetchSFTP does employ connection pooling behind the scenes such that > >> it will keep open a connection for each concurrent task, as long as > >> each connection continues to be used with in 10 seconds. > >> > >> -Bryan > >> > >> > >> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt <[email protected]> wrote: > >> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i > >> > can confirm there are users at that range for it. > >> > > >> > Thanks > >> > > >> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <[email protected]> > >> wrote: > >> >> I've found that on a single node getSFTP is able to pull more files > off > >> a > >> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a > >> max > >> >> selects so it is requiring way more connections (one per file?) and > >> >> concurrent threads to keep up. > >> >> > >> >> Was wondering if anyone is using List/Fetch at scale? In the multi > TB's > >> a > >> >> day range? > >> >> > >> >> Thanks, > >> >> Ryan > >> >
