Mark Payne created NIFI-11178:
---------------------------------
Summary: Improve memory efficiency of ListHDFS
Key: NIFI-11178
URL: https://issues.apache.org/jira/browse/NIFI-11178
Project: Apache NiFi
Issue Type: Improvement
Components: Extensions
Reporter: Mark Payne
ListHDFS is used extremely commonly. Typically, a listing consists of several
hundred files or less. However, there are times (especially when performing the
first listing) when the Processor is configured to recurse into subdirectories
and creates a listing containing millions of files.
Currently, performing a listing containing millions of files can occupy several
GB of heap space. Analyzing a recent heap dump, it was found that a listing of
6.7 million files in HDFS occupied approximately 12 GB of heap space in NiFi.
The heap usage can be tracked back to the fact that we hold in a local variable
a HashSet<FileStatus> and these FileStatus objects occupy approximately 1-2 KB
of heap each.
There are several improvements that can be made here, some small changes will
have significant memory improvements. There are also large changes that could
dramatically reduce heap utilization, to nearly nothing. But these would
require complex rewrites of the Processor that would be much more difficult to
maintain.
As a simple analysis, I built a small test to see how many FileStatus objects
could be kept in memory:
{code:java}
final Set<FileStatus> statuses = new HashSet<>();
for (int i=0; i < 10_000_000; i++) {
if (i % 10_000 == 0) {
System.out.println(i);
}
final FsPermission fsPermission = new FsPermission(777);
final FileStatus status = new FileStatus(2, true, 1, 4_000_000,
System.currentTimeMillis(), 0L, fsPermission,
"owner-" + i, "group-" + i, null, new Path("/path/to/my/file-" + i +
".txt"), true, false, false);
statuses.add(status);
}
{code}
This gives us a way to see how many FileStatus objects can be added to our set
before we encounter OOME. With a 512 MB heap I reached approximately 1.13
million FileStatus objects. Note that the Paths here are very small, which
occupies less memory than would normally be the case.
Making one small change, to change the {{Set<FileStatus>}} to an
{{ArrayList<FileStatus>}} yielded 1.21 million instead of 1.13 million objects.
This is reasonable, since we don't expect any duplicates anyway.
Another small change that was made was to introduce a new class that keeps only
the fields we care about from the FileStatus:
{code:java}
private static class HdfsEntity {
private final String path;
private final long length;
private final boolean isdir;
private final long timestamp;
private final boolean canRead;
private final boolean canWrite;
private final boolean canExecute;
private final boolean encrypted;
public HdfsEntity(final FileStatus status) {
this.path = status.getPath().getName();
this.length = status.getLen();
this.isdir = status.isDirectory();
this.timestamp = status.getModificationTime();
this.canRead =
status.getPermission().getGroupAction().implies(FsAction.READ);
this.canWrite =
status.getPermission().getGroupAction().implies(FsAction.WRITE);
this.canExecute =
status.getPermission().getGroupAction().implies(FsAction.EXECUTE);
this.encrypted = status.isEncrypted();
}
}
{code}
This introduced significant savings, allowing a {{List<HdfsEntity>}} to store
5.2 million objects instead of 1.13 million.
It is worth noting here that the HdfsEntity doesn't store the 'group' and
'owner' that are part of the FileStatus. Interestingly, though, these values
are strings that are extremely repetitive, but are unique strings on the heap.
So a full solution here would mean tracking the owner and group, but doing so
in a way that we use a Map<String, String> or something similar in order to
reuse identical Strings on the heap (in much the same way that String.intern()
does, but without interning the String as it's only reusable within a small
context).
These changes will yield significant memory improvements.
However, more penetrating changes can make even large improvements:
* Rather than keeping a Set or a List of entities at all, we could instead
just iterate over each of the FileStatus objects returned from HDFS and
determine whether or not the file should be included in the listing. If so,
create the FlowFile or write the Record to the RecordWriter. Eliminate the
collection all together. This would likely result in code that is similar to
ListS3, which has an internal {{S3ObjectWriter}} class. This is used to create
an interface that can be used regardless of whether a RecordWriter is being
used or not. This would provide very significant memory improvements. In the
case of using a Record Writer, it may even reduce the size to nearly 0.
* However, when not using a Record Writer, we still will have the FlowFiles
kept in memory until the session is committed. One method for dealing with this
would be to change our algorithm so that instead of performing a single listing
of the directory and all sub-directories, we instead sort sub-directories by
name and then commit the session and update state for individual
sub-directories. This would introduce significant risk, though, in ensuring
that we don't create message duplication upon restart and that we don't lose
data. So it's perhaps not the best option.
* Alternatively, we could document the concern when not using a Record Writer
and provide info in an additionalDetails.html that shows the preferred method
for using the processor when expecting many files. ListHDFS would be connected
to SplitRecord that would split into chunks of say 10,000 Records. This would
then go to a PartitionRecord that would create attributes for the fields. This
would give us the same result as outputting without the Record Writer but in
such a way that is much more memory efficient.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)