The list-fetch approach sounds correct, and the micro acquisition cluster (if necessary) also sounds like a good idea.
Regarding multiple hosts, the connection pooling in FetchSFTP does account for that. Its basically a map from the hostname string to a holder of connections for that hostname. -Bryan On Tue, Oct 31, 2017 at 7:55 PM, Ryan Ward <[email protected]> wrote: > Yep that's exactly how I have it set up with a push to RPG. Is that > preferred? I just started playing with it to be honest. I can see how it > could be tricky if you have to pull from multiple servers each flow file > could potentially have a different sftp host address in the queues. > > All together we have to pull from about 60 servers. If this doesn't work > out with the list/fetch I plan to have a micro acquisition cluster just > for gets. > > Ryan > > On Oct 31, 2017 4:26 PM, "Bryan Bende" <[email protected]> wrote: > >> Ryan, >> >> The 10 seconds appears to be a hard-code rule in the processor, >> although it seems like it could be turned into a configurable >> property. >> >> It would require a code change to make it grab a batch of flow files >> during a single execution. In theory it shouldn't provide that much of >> a difference, but might be an interesting experiment. It makes the >> code more challenging to write though, not that that's a reason not to >> do it. >> >> If you have a 5 node cluster, you are doing List on primary node and >> then redistributing the results to all the nodes via an RPG so all >> nodes can fetch? >> >> -Bryan >> >> >> On Tue, Oct 31, 2017 at 3:43 PM, Ryan Ward <[email protected]> wrote: >> > Joe/Bryan Thanks! >> > >> > I believe the one specific file per concurrent task/connection (and too >> > many threads) is the issue I have we have a lot of small files and often >> > times backed up . I'm going to drop the task count to take advantage of >> the >> > pooling. Is it possible to have Fetch do batches vs a single file? Would >> > that improve throughput? Also is that 10 seconds configurable? >> > >> > Some background: I'm converting 2 single nodes into a 5 node cluster and >> > trying to figure out the best approach. >> > >> > Thanks again! >> > >> > >> > >> > On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende <[email protected]> wrote: >> > >> >> Ryan, >> >> >> >> Personally I don't have experience running these processors at scale, >> >> but from a code perspective they are fundamentally different... >> >> >> >> GetSFTP is a source processor, meaning is not being fed by an upstream >> >> connection, so when it executes it can create a connection and >> >> retrieve up to max-selects during that one execution. >> >> >> >> FetchSFTP is being told to fetch one specific file, typically through >> >> attributes on incoming flow files, so the concept of max-selects >> >> doesn't really apply because there is only thing to select during an >> >> execution of the processor. >> >> >> >> FetchSFTP does employ connection pooling behind the scenes such that >> >> it will keep open a connection for each concurrent task, as long as >> >> each connection continues to be used with in 10 seconds. >> >> >> >> -Bryan >> >> >> >> >> >> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt <[email protected]> wrote: >> >> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i >> >> > can confirm there are users at that range for it. >> >> > >> >> > Thanks >> >> > >> >> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <[email protected]> >> >> wrote: >> >> >> I've found that on a single node getSFTP is able to pull more files >> off >> >> a >> >> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a >> >> max >> >> >> selects so it is requiring way more connections (one per file?) and >> >> >> concurrent threads to keep up. >> >> >> >> >> >> Was wondering if anyone is using List/Fetch at scale? In the multi >> TB's >> >> a >> >> >> day range? >> >> >> >> >> >> Thanks, >> >> >> Ryan >> >> >>
