Re: FetchSFTP vs GetSFTP
The list-fetch approach sounds correct, and the micro acquisition cluster (if necessary) also sounds like a good idea. Regarding multiple hosts, the connection pooling in FetchSFTP does account for that. Its basically a map from the hostname string to a holder of connections for that hostname. -Bryan On Tue, Oct 31, 2017 at 7:55 PM, Ryan Wardwrote: > Yep that's exactly how I have it set up with a push to RPG. Is that > preferred? I just started playing with it to be honest. I can see how it > could be tricky if you have to pull from multiple servers each flow file > could potentially have a different sftp host address in the queues. > > All together we have to pull from about 60 servers. If this doesn't work > out with the list/fetch I plan to have a micro acquisition cluster just > for gets. > > Ryan > > On Oct 31, 2017 4:26 PM, "Bryan Bende" wrote: > >> Ryan, >> >> The 10 seconds appears to be a hard-code rule in the processor, >> although it seems like it could be turned into a configurable >> property. >> >> It would require a code change to make it grab a batch of flow files >> during a single execution. In theory it shouldn't provide that much of >> a difference, but might be an interesting experiment. It makes the >> code more challenging to write though, not that that's a reason not to >> do it. >> >> If you have a 5 node cluster, you are doing List on primary node and >> then redistributing the results to all the nodes via an RPG so all >> nodes can fetch? >> >> -Bryan >> >> >> On Tue, Oct 31, 2017 at 3:43 PM, Ryan Ward wrote: >> > Joe/Bryan Thanks! >> > >> > I believe the one specific file per concurrent task/connection (and too >> > many threads) is the issue I have we have a lot of small files and often >> > times backed up . I'm going to drop the task count to take advantage of >> the >> > pooling. Is it possible to have Fetch do batches vs a single file? Would >> > that improve throughput? Also is that 10 seconds configurable? >> > >> > Some background: I'm converting 2 single nodes into a 5 node cluster and >> > trying to figure out the best approach. >> > >> > Thanks again! >> > >> > >> > >> > On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende wrote: >> > >> >> Ryan, >> >> >> >> Personally I don't have experience running these processors at scale, >> >> but from a code perspective they are fundamentally different... >> >> >> >> GetSFTP is a source processor, meaning is not being fed by an upstream >> >> connection, so when it executes it can create a connection and >> >> retrieve up to max-selects during that one execution. >> >> >> >> FetchSFTP is being told to fetch one specific file, typically through >> >> attributes on incoming flow files, so the concept of max-selects >> >> doesn't really apply because there is only thing to select during an >> >> execution of the processor. >> >> >> >> FetchSFTP does employ connection pooling behind the scenes such that >> >> it will keep open a connection for each concurrent task, as long as >> >> each connection continues to be used with in 10 seconds. >> >> >> >> -Bryan >> >> >> >> >> >> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt wrote: >> >> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i >> >> > can confirm there are users at that range for it. >> >> > >> >> > Thanks >> >> > >> >> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward >> >> wrote: >> >> >> I've found that on a single node getSFTP is able to pull more files >> off >> >> a >> >> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a >> >> max >> >> >> selects so it is requiring way more connections (one per file?) and >> >> >> concurrent threads to keep up. >> >> >> >> >> >> Was wondering if anyone is using List/Fetch at scale? In the multi >> TB's >> >> a >> >> >> day range? >> >> >> >> >> >> Thanks, >> >> >> Ryan >> >> >>
Re: FetchSFTP vs GetSFTP
Yep that's exactly how I have it set up with a push to RPG. Is that preferred? I just started playing with it to be honest. I can see how it could be tricky if you have to pull from multiple servers each flow file could potentially have a different sftp host address in the queues. All together we have to pull from about 60 servers. If this doesn't work out with the list/fetch I plan to have a micro acquisition cluster just for gets. Ryan On Oct 31, 2017 4:26 PM, "Bryan Bende"wrote: > Ryan, > > The 10 seconds appears to be a hard-code rule in the processor, > although it seems like it could be turned into a configurable > property. > > It would require a code change to make it grab a batch of flow files > during a single execution. In theory it shouldn't provide that much of > a difference, but might be an interesting experiment. It makes the > code more challenging to write though, not that that's a reason not to > do it. > > If you have a 5 node cluster, you are doing List on primary node and > then redistributing the results to all the nodes via an RPG so all > nodes can fetch? > > -Bryan > > > On Tue, Oct 31, 2017 at 3:43 PM, Ryan Ward wrote: > > Joe/Bryan Thanks! > > > > I believe the one specific file per concurrent task/connection (and too > > many threads) is the issue I have we have a lot of small files and often > > times backed up . I'm going to drop the task count to take advantage of > the > > pooling. Is it possible to have Fetch do batches vs a single file? Would > > that improve throughput? Also is that 10 seconds configurable? > > > > Some background: I'm converting 2 single nodes into a 5 node cluster and > > trying to figure out the best approach. > > > > Thanks again! > > > > > > > > On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende wrote: > > > >> Ryan, > >> > >> Personally I don't have experience running these processors at scale, > >> but from a code perspective they are fundamentally different... > >> > >> GetSFTP is a source processor, meaning is not being fed by an upstream > >> connection, so when it executes it can create a connection and > >> retrieve up to max-selects during that one execution. > >> > >> FetchSFTP is being told to fetch one specific file, typically through > >> attributes on incoming flow files, so the concept of max-selects > >> doesn't really apply because there is only thing to select during an > >> execution of the processor. > >> > >> FetchSFTP does employ connection pooling behind the scenes such that > >> it will keep open a connection for each concurrent task, as long as > >> each connection continues to be used with in 10 seconds. > >> > >> -Bryan > >> > >> > >> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt wrote: > >> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i > >> > can confirm there are users at that range for it. > >> > > >> > Thanks > >> > > >> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward > >> wrote: > >> >> I've found that on a single node getSFTP is able to pull more files > off > >> a > >> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a > >> max > >> >> selects so it is requiring way more connections (one per file?) and > >> >> concurrent threads to keep up. > >> >> > >> >> Was wondering if anyone is using List/Fetch at scale? In the multi > TB's > >> a > >> >> day range? > >> >> > >> >> Thanks, > >> >> Ryan > >> >
Re: FetchSFTP vs GetSFTP
Joe/Bryan Thanks! I believe the one specific file per concurrent task/connection (and too many threads) is the issue I have we have a lot of small files and often times backed up . I'm going to drop the task count to take advantage of the pooling. Is it possible to have Fetch do batches vs a single file? Would that improve throughput? Also is that 10 seconds configurable? Some background: I'm converting 2 single nodes into a 5 node cluster and trying to figure out the best approach. Thanks again! On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bendewrote: > Ryan, > > Personally I don't have experience running these processors at scale, > but from a code perspective they are fundamentally different... > > GetSFTP is a source processor, meaning is not being fed by an upstream > connection, so when it executes it can create a connection and > retrieve up to max-selects during that one execution. > > FetchSFTP is being told to fetch one specific file, typically through > attributes on incoming flow files, so the concept of max-selects > doesn't really apply because there is only thing to select during an > execution of the processor. > > FetchSFTP does employ connection pooling behind the scenes such that > it will keep open a connection for each concurrent task, as long as > each connection continues to be used with in 10 seconds. > > -Bryan > > > On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt wrote: > > Ryan - dont know the code specifics behind FetchSFTP off-hand but i > > can confirm there are users at that range for it. > > > > Thanks > > > > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward > wrote: > >> I've found that on a single node getSFTP is able to pull more files off > a > >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a > max > >> selects so it is requiring way more connections (one per file?) and > >> concurrent threads to keep up. > >> > >> Was wondering if anyone is using List/Fetch at scale? In the multi TB's > a > >> day range? > >> > >> Thanks, > >> Ryan >
Re: FetchSFTP vs GetSFTP
Ryan, Personally I don't have experience running these processors at scale, but from a code perspective they are fundamentally different... GetSFTP is a source processor, meaning is not being fed by an upstream connection, so when it executes it can create a connection and retrieve up to max-selects during that one execution. FetchSFTP is being told to fetch one specific file, typically through attributes on incoming flow files, so the concept of max-selects doesn't really apply because there is only thing to select during an execution of the processor. FetchSFTP does employ connection pooling behind the scenes such that it will keep open a connection for each concurrent task, as long as each connection continues to be used with in 10 seconds. -Bryan On Tue, Oct 31, 2017 at 11:43 AM, Joe Wittwrote: > Ryan - dont know the code specifics behind FetchSFTP off-hand but i > can confirm there are users at that range for it. > > Thanks > > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward wrote: >> I've found that on a single node getSFTP is able to pull more files off a >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a max >> selects so it is requiring way more connections (one per file?) and >> concurrent threads to keep up. >> >> Was wondering if anyone is using List/Fetch at scale? In the multi TB's a >> day range? >> >> Thanks, >> Ryan
Re: FetchSFTP vs GetSFTP
Ryan - dont know the code specifics behind FetchSFTP off-hand but i can confirm there are users at that range for it. Thanks On Tue, Oct 31, 2017 at 11:38 AM, Ryan Wardwrote: > I've found that on a single node getSFTP is able to pull more files off a > remote server than Fetch in a cluster. I noticed Fetch doesn't have a max > selects so it is requiring way more connections (one per file?) and > concurrent threads to keep up. > > Was wondering if anyone is using List/Fetch at scale? In the multi TB's a > day range? > > Thanks, > Ryan
FetchSFTP vs GetSFTP
I've found that on a single node getSFTP is able to pull more files off a remote server than Fetch in a cluster. I noticed Fetch doesn't have a max selects so it is requiring way more connections (one per file?) and concurrent threads to keep up. Was wondering if anyone is using List/Fetch at scale? In the multi TB's a day range? Thanks, Ryan