[
https://issues.apache.org/jira/browse/NIFI-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
B O updated NIFI-5157:
----------------------
Description:
Currently, if ListSFTP is used on a folder with millions and millions of files
and Primary Node has only 32GB of RAM, then to create millions of flowfiles
above say 40 million, it could result in frozen threads for ListSFTP, resulting
in having to restart Primary Node.
This happens when say another system sends files to your system and eventually
builds up a backlog of 10s of millions of files. Recursion won't work either
even if you separated by folder, or otherwise you'd need some sort of
"controlRate" like processor that can pass in flowfiles into ListSFTP resulting
in ListSFTP knowing when to get files (triggering on its own).
Maybe get only 500,000 flowfiles at a time from SFTP. Or have a check for RAM
so that it doesn't try to pull in more than some formula based on available
Heap memory. I found it interesting GetFile has a lot of these properties while
ListFile/ListSFTP/ListHDFS don't.
Also, there seems to be situations where Nifi kind of assumes a stable
environment, but in unstable ones, where memory hardware failures happen, SFTP
transmission problems, internet outages, it becomes difficult to recover an
ingest or know where you left off (which might be useful for ListSFTP):
Batch-processing usually requires a system to say separate things out into X
amount of files/folders that can fit into the RAM of the primary-node. We may
need some kind of feature like SQL's Transaction "Commit" and "Rollback in case
of error". There needs to be an efficient way for small systems to take in
large volumes of data without crashing or if crashes are inevitable then it
needs some sort of batch transaction that can tell you where it left off so
that you don't have to pull the same folder again but only say after File Age =
some-number. When you press "view state" I find it funny that you can't edit it
only delete.
I should be able to login tomorrow and say "oh my ingest totally collapsed, but
at least I know where it left off somewhat." Especially when WAL recovery is
impossible due to socket connection issues between nodes (or site-to-site
active connections) causing some Nifi nodes to refuse to load or recover its
state.
I would like the ability to be able to customize ListSFTP in a way that tracks
things better even in situations of disaster in the nifi cluster recovery.
Perhaps Inputs into ListSFTP utilizing the expression language for timestamps
of folders.
I always have to place a control rate after listsftp, but i can never do
control-rating within the ListSFTP.
was:
Currently, if ListSFTP is used on a folder with millions and millions of files
and Primary Node has only 32GB of RAM, then to create millions of flowfiles
above say 40 million, it could result in frozen threads for ListSFTP, resulting
in having to restart Primary Node.
This happens when say another system sends files to your system and eventually
builds up a backlog of 10s of millions of files. Recursion won't work either
even if you separated by folder, or otherwise you'd need some sort of
"controlRate" like processor that can pass in flowfiles into ListSFTP resulting
in ListSFTP knowing when to get files (triggering on its own).
Also, there seems to be situations where Nifi kind of assumes a stable
environment, but in unstable ones, where memory hardware failures happen, SFTP
transmission problems, internet outages, it becomes difficult to recover an
ingest or know where you left off (which might be useful for ListSFTP):
Batch-processing usually requires a system to say separate things out into X
amount of files/folders that can fit into the RAM of the primary-node. We may
need some kind of feature like SQL's Transaction "Commit" and "Rollback in case
of error". There needs to be an efficient way for small systems to take in
large volumes of data without crashing or if crashes are inevitable then it
needs some sort of batch transaction that can tell you where it left off so
that you don't have to pull the same folder again but only say after File Age =
some-number. I should be able to login tomorrow and say "oh my ingest totally
collapsed, but at least I know where it left off somewhat." Especially when WAL
recovery is impossible due to socket connection issues between nodes (or
site-to-site active connections) causing some Nifi nodes to refuse to load or
recover its state.
I would like the ability to be able to customize ListSFTP in a way that tracks
things better even in situations of disaster in the nifi cluster recovery.
Perhaps Inputs into ListSFTP utilizing the expression language for timestamps
of folders.
I always have to place a control rate after listsftp, but i can never do
control-rating within the ListSFTP.
> ListSFTP for Massive Folders (without freezing)
> -----------------------------------------------
>
> Key: NIFI-5157
> URL: https://issues.apache.org/jira/browse/NIFI-5157
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework
> Affects Versions: 1.3.0
> Reporter: B O
> Priority: Major
> Labels: performance
>
> Currently, if ListSFTP is used on a folder with millions and millions of
> files and Primary Node has only 32GB of RAM, then to create millions of
> flowfiles above say 40 million, it could result in frozen threads for
> ListSFTP, resulting in having to restart Primary Node.
> This happens when say another system sends files to your system and
> eventually builds up a backlog of 10s of millions of files. Recursion won't
> work either even if you separated by folder, or otherwise you'd need some
> sort of "controlRate" like processor that can pass in flowfiles into ListSFTP
> resulting in ListSFTP knowing when to get files (triggering on its own).
> Maybe get only 500,000 flowfiles at a time from SFTP. Or have a check for RAM
> so that it doesn't try to pull in more than some formula based on available
> Heap memory. I found it interesting GetFile has a lot of these properties
> while ListFile/ListSFTP/ListHDFS don't.
> Also, there seems to be situations where Nifi kind of assumes a stable
> environment, but in unstable ones, where memory hardware failures happen,
> SFTP transmission problems, internet outages, it becomes difficult to recover
> an ingest or know where you left off (which might be useful for ListSFTP):
> Batch-processing usually requires a system to say separate things out into X
> amount of files/folders that can fit into the RAM of the primary-node. We may
> need some kind of feature like SQL's Transaction "Commit" and "Rollback in
> case of error". There needs to be an efficient way for small systems to take
> in large volumes of data without crashing or if crashes are inevitable then
> it needs some sort of batch transaction that can tell you where it left off
> so that you don't have to pull the same folder again but only say after File
> Age = some-number. When you press "view state" I find it funny that you can't
> edit it only delete.
> I should be able to login tomorrow and say "oh my ingest totally collapsed,
> but at least I know where it left off somewhat." Especially when WAL recovery
> is impossible due to socket connection issues between nodes (or site-to-site
> active connections) causing some Nifi nodes to refuse to load or recover its
> state.
> I would like the ability to be able to customize ListSFTP in a way that
> tracks things better even in situations of disaster in the nifi cluster
> recovery. Perhaps Inputs into ListSFTP utilizing the expression language for
> timestamps of folders.
> I always have to place a control rate after listsftp, but i can never do
> control-rating within the ListSFTP.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)