[ 
https://issues.apache.org/jira/browse/NIFI-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

B O updated NIFI-5157:
----------------------
    Description: 
Currently, if ListSFTP is used on a folder with millions and millions of files 
and Primary Node has only 32GB of RAM, then to create millions of flowfiles 
above say 40 million, it could result in frozen threads for ListSFTP, resulting 
in having to restart Primary Node.

This happens when say another system sends files to your system and eventually 
builds up a backlog of 10s of millions of files. Recursion won't work either 
even if you separated by folder, or otherwise you'd need some sort of 
"controlRate" like processor that can pass in flowfiles into ListSFTP resulting 
in ListSFTP knowing when to get files (triggering on its own).

Maybe get only 500,000 flowfiles at a time from SFTP. Or have a check for RAM 
so that it doesn't try to pull in more than some formula based on available 
Heap memory. I found it interesting GetFile has a lot of these properties while 
ListFile/ListSFTP/ListHDFS don't.

Also, there seems to be situations where Nifi kind of assumes a stable 
environment, but in unstable ones, where memory hardware failures happen, SFTP 
transmission problems, internet outages, it becomes difficult to recover an 
ingest or know where you left off (which might be useful for ListSFTP):

Batch-processing usually requires a system to say separate things out into X 
amount of files/folders that can fit into the RAM of the primary-node. We may 
need some kind of feature like SQL's Transaction "Commit" and "Rollback in case 
of error". There needs to be an efficient way for small systems to take in 
large volumes of data without crashing or if crashes are inevitable then it 
needs some sort of batch transaction that can tell you where it left off so 
that you don't have to pull the same folder again but only say after File Age = 
some-number. When you press "view state" I find it funny that you can't edit it 
only delete. 

I should be able to login tomorrow and say "oh my ingest totally collapsed, but 
at least I know where it left off somewhat." Especially when WAL recovery is 
impossible due to socket connection issues between nodes (or site-to-site 
active connections) causing some Nifi nodes to refuse to load or recover its 
state.

I would like the ability to be able to customize ListSFTP in a way that tracks 
things better even in situations of disaster in the nifi cluster recovery. 
Perhaps Inputs into ListSFTP utilizing the expression language for timestamps 
of folders.

I always have to place a control rate after listsftp, but i can never do 
control-rating within the ListSFTP.

  was:
Currently, if ListSFTP is used on a folder with millions and millions of files 
and Primary Node has only 32GB of RAM, then to create millions of flowfiles 
above say 40 million, it could result in frozen threads for ListSFTP, resulting 
in having to restart Primary Node.

This happens when say another system sends files to your system and eventually 
builds up a backlog of 10s of millions of files. Recursion won't work either 
even if you separated by folder, or otherwise you'd need some sort of 
"controlRate" like processor that can pass in flowfiles into ListSFTP resulting 
in ListSFTP knowing when to get files (triggering on its own).

Also, there seems to be situations where Nifi kind of assumes a stable 
environment, but in unstable ones, where memory hardware failures happen, SFTP 
transmission problems, internet outages, it becomes difficult to recover an 
ingest or know where you left off (which might be useful for ListSFTP):

Batch-processing usually requires a system to say separate things out into X 
amount of files/folders that can fit into the RAM of the primary-node. We may 
need some kind of feature like SQL's Transaction "Commit" and "Rollback in case 
of error". There needs to be an efficient way for small systems to take in 
large volumes of data without crashing or if crashes are inevitable then it 
needs some sort of batch transaction that can tell you where it left off so 
that you don't have to pull the same folder again but only say after File Age = 
some-number. I should be able to login tomorrow and say "oh my ingest totally 
collapsed, but at least I know where it left off somewhat." Especially when WAL 
recovery is impossible due to socket connection issues between nodes (or 
site-to-site active connections) causing some Nifi nodes to refuse to load or 
recover its state.

I would like the ability to be able to customize ListSFTP in a way that tracks 
things better even in situations of disaster in the nifi cluster recovery. 
Perhaps Inputs into ListSFTP utilizing the expression language for timestamps 
of folders.

I always have to place a control rate after listsftp, but i can never do 
control-rating within the ListSFTP.


> ListSFTP for Massive Folders (without freezing)
> -----------------------------------------------
>
>                 Key: NIFI-5157
>                 URL: https://issues.apache.org/jira/browse/NIFI-5157
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>    Affects Versions: 1.3.0
>            Reporter: B O
>            Priority: Major
>              Labels: performance
>
> Currently, if ListSFTP is used on a folder with millions and millions of 
> files and Primary Node has only 32GB of RAM, then to create millions of 
> flowfiles above say 40 million, it could result in frozen threads for 
> ListSFTP, resulting in having to restart Primary Node.
> This happens when say another system sends files to your system and 
> eventually builds up a backlog of 10s of millions of files. Recursion won't 
> work either even if you separated by folder, or otherwise you'd need some 
> sort of "controlRate" like processor that can pass in flowfiles into ListSFTP 
> resulting in ListSFTP knowing when to get files (triggering on its own).
> Maybe get only 500,000 flowfiles at a time from SFTP. Or have a check for RAM 
> so that it doesn't try to pull in more than some formula based on available 
> Heap memory. I found it interesting GetFile has a lot of these properties 
> while ListFile/ListSFTP/ListHDFS don't.
> Also, there seems to be situations where Nifi kind of assumes a stable 
> environment, but in unstable ones, where memory hardware failures happen, 
> SFTP transmission problems, internet outages, it becomes difficult to recover 
> an ingest or know where you left off (which might be useful for ListSFTP):
> Batch-processing usually requires a system to say separate things out into X 
> amount of files/folders that can fit into the RAM of the primary-node. We may 
> need some kind of feature like SQL's Transaction "Commit" and "Rollback in 
> case of error". There needs to be an efficient way for small systems to take 
> in large volumes of data without crashing or if crashes are inevitable then 
> it needs some sort of batch transaction that can tell you where it left off 
> so that you don't have to pull the same folder again but only say after File 
> Age = some-number. When you press "view state" I find it funny that you can't 
> edit it only delete. 
> I should be able to login tomorrow and say "oh my ingest totally collapsed, 
> but at least I know where it left off somewhat." Especially when WAL recovery 
> is impossible due to socket connection issues between nodes (or site-to-site 
> active connections) causing some Nifi nodes to refuse to load or recover its 
> state.
> I would like the ability to be able to customize ListSFTP in a way that 
> tracks things better even in situations of disaster in the nifi cluster 
> recovery. Perhaps Inputs into ListSFTP utilizing the expression language for 
> timestamps of folders.
> I always have to place a control rate after listsftp, but i can never do 
> control-rating within the ListSFTP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to