Re: FetchSFTP vs GetSFTP

Ryan Ward Tue, 31 Oct 2017 16:56:40 -0700

Yep that's exactly how I have it set up with a push to RPG. Is that
preferred? I just started playing with it to be honest. I can see how it
could be tricky if you have to pull from multiple servers each flow file
could potentially have a different sftp host address in the queues.


All together we have to pull from about 60 servers. If this doesn't work
out with the list/fetch  I plan to have a micro acquisition cluster just
for gets.

Ryan

On Oct 31, 2017 4:26 PM, "Bryan Bende" <[email protected]> wrote:

> Ryan,
>
> The 10 seconds appears to be a hard-code rule in the processor,
> although it seems like it could be turned into a configurable
> property.
>
> It would require a code change to make it grab a batch of flow files
> during a single execution. In theory it shouldn't provide that much of
> a difference, but might be an interesting experiment. It makes the
> code more challenging to write though, not that that's a reason not to
> do it.
>
> If you have a 5 node cluster, you are doing List on primary node and
> then redistributing the results to all the nodes via an RPG so all
> nodes can fetch?
>
> -Bryan
>
>
> On Tue, Oct 31, 2017 at 3:43 PM, Ryan Ward <[email protected]> wrote:
> > Joe/Bryan Thanks!
> >
> > I believe the one specific file per concurrent task/connection (and too
> > many threads) is the issue I have we have a lot of small files and often
> > times backed up . I'm going to drop the task count to take advantage of
> the
> > pooling. Is it possible to have Fetch do batches vs a single file? Would
> > that improve throughput? Also is that 10 seconds configurable?
> >
> > Some background: I'm converting 2 single nodes into a 5 node cluster and
> > trying to figure out the best approach.
> >
> > Thanks again!
> >
> >
> >
> > On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende <[email protected]> wrote:
> >
> >> Ryan,
> >>
> >> Personally I don't have experience running these processors at scale,
> >> but from a code perspective they are fundamentally different...
> >>
> >> GetSFTP is a source processor, meaning is not being fed by an upstream
> >> connection, so when it executes it can create a connection and
> >> retrieve up to max-selects during that one execution.
> >>
> >> FetchSFTP is being told to fetch one specific file, typically through
> >> attributes on incoming flow files, so the concept of max-selects
> >> doesn't really apply because there is only thing to select during an
> >> execution of the processor.
> >>
> >> FetchSFTP does employ connection pooling behind the scenes such that
> >> it will keep open a connection for each concurrent task, as long as
> >> each connection continues to be used with in 10 seconds.
> >>
> >> -Bryan
> >>
> >>
> >> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt <[email protected]> wrote:
> >> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i
> >> > can confirm there are users at that range for it.
> >> >
> >> > Thanks
> >> >
> >> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <[email protected]>
> >> wrote:
> >> >> I've found that on a single node getSFTP is able to pull more files
> off
> >> a
> >> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a
> >> max
> >> >> selects so it is requiring way more connections (one per file?) and
> >> >> concurrent threads to keep up.
> >> >>
> >> >> Was wondering if anyone is using List/Fetch at scale? In the multi
> TB's
> >> a
> >> >> day range?
> >> >>
> >> >> Thanks,
> >> >> Ryan
> >>
>

Re: FetchSFTP vs GetSFTP

Reply via email to