B O created NIFI-5157:
-------------------------

             Summary: ListSFTP for Massive Folders (without freezing)
                 Key: NIFI-5157
                 URL: https://issues.apache.org/jira/browse/NIFI-5157
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Core Framework
    Affects Versions: 1.3.0
            Reporter: B O


Currently, if ListSFTP is used on a folder with millions and millions of files 
and Primary Node has only 32GB of RAM, then to create millions of flowfiles 
above say 40 million, it could result in frozen threads for ListSFTP, resulting 
in having to restart Primary Node.

This happens when say another system sends files to your system and eventually 
builds up a backlog of 10s of millions of files. Recursion won't work either 
even if you separated by folder, or otherwise you'd need some sort of 
"controlRate" like processor that can pass in flowfiles into ListSFTP resulting 
in ListSFTP knowing when to get files (triggering on its own).

Also, there seems to be situations where Nifi kind of assumes a stable 
environment, but in unstable ones, where memory hardware failures happen, SFTP 
transmission problems, internet outages, it becomes difficult to recover an 
ingest or know where you left off (which might be useful for ListSFTP):

Batch-processing usually requires a system to say separate things out into X 
amount of files/folders that can fit into the RAM of the primary-node. We may 
need some kind of feature like SQL's Transaction "Commit" and "Rollback in case 
of error". There needs to be an efficient way for small systems to take in 
large volumes of data without crashing or if crashes are inevitable then it 
needs some sort of batch transaction that can tell you where it left off so 
that you don't have to pull the same folder again but only say after File Age = 
some-number. I should be able to login tomorrow and say "oh my ingest totally 
collapsed, but at least I know where it left off somewhat." Especially when WAL 
recovery is impossible due to socket connection issues between nodes (or 
site-to-site active connections) causing some Nifi nodes to refuse to load or 
recover its state.

I would like the ability to be able to customize ListSFTP in a way that tracks 
things better even in situations of disaster in the nifi cluster recovery. 
Perhaps Inputs into ListSFTP utilizing the expression language for timestamps 
of folders.

I always have to place a control rate after listsftp, but i can never do 
control-rating within the ListSFTP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to