[
https://issues.apache.org/jira/browse/NIFI-11178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tamas Palfy updated NIFI-11178:
-------------------------------
Fix Version/s: 2.0.0
1.23.0
Resolution: Fixed
Status: Resolved (was: Patch Available)
> Improve memory efficiency of ListHDFS
> -------------------------------------
>
> Key: NIFI-11178
> URL: https://issues.apache.org/jira/browse/NIFI-11178
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Mark Payne
> Assignee: Lehel Boér
> Priority: Major
> Fix For: 2.0.0, 1.23.0
>
> Attachments: image-2023-06-28-20-48-40-182.png,
> image-2023-06-28-20-48-57-012.png, image-2023-06-28-20-53-21-210.png,
> image-2023-06-28-20-54-00-887.png, image-2023-06-28-20-54-32-188.png
>
> Time Spent: 4h 50m
> Remaining Estimate: 0h
>
> ListHDFS is used extremely commonly. Typically, a listing consists of several
> hundred files or less. However, there are times (especially when performing
> the first listing) when the Processor is configured to recurse into
> subdirectories and creates a listing containing millions of files.
> Currently, performing a listing containing millions of files can occupy
> several GB of heap space. Analyzing a recent heap dump, it was found that a
> listing of 6.7 million files in HDFS occupied approximately 12 GB of heap
> space in NiFi. The heap usage can be tracked back to the fact that we hold in
> a local variable a HashSet<FileStatus> and these FileStatus objects occupy
> approximately 1-2 KB of heap each.
> There are several improvements that can be made here, some small changes will
> have significant memory improvements. There are also large changes that could
> dramatically reduce heap utilization, to nearly nothing. But these would
> require complex rewrites of the Processor that would be much more difficult
> to maintain.
> As a simple analysis, I built a small test to see how many FileStatus objects
> could be kept in memory:
> {code:java}
> final Set<FileStatus> statuses = new HashSet<>();
> for (int i=0; i < 10_000_000; i++) {
> if (i % 10_000 == 0) {
> System.out.println(i);
> }
> final FsPermission fsPermission = new FsPermission(777);
> final FileStatus status = new FileStatus(2, true, 1, 4_000_000,
> System.currentTimeMillis(), 0L, fsPermission,
> "owner-" + i, "group-" + i, null, new Path("/path/to/my/file-" + i +
> ".txt"), true, false, false);
> statuses.add(status);
> }
> {code}
> This gives us a way to see how many FileStatus objects can be added to our
> set before we encounter OOME. With a 512 MB heap I reached approximately 1.13
> million FileStatus objects. Note that the Paths here are very small, which
> occupies less memory than would normally be the case.
> Making one small change, to change the {{Set<FileStatus>}} to an
> {{ArrayList<FileStatus>}} yielded 1.21 million instead of 1.13 million
> objects. This is reasonable, since we don't expect any duplicates anyway.
> Another small change that was made was to introduce a new class that keeps
> only the fields we care about from the FileStatus:
> {code:java}
> private static class HdfsEntity {
> private final String path;
> private final long length;
> private final boolean isdir;
> private final long timestamp;
> private final boolean canRead;
> private final boolean canWrite;
> private final boolean canExecute;
> private final boolean encrypted;
> public HdfsEntity(final FileStatus status) {
> this.path = status.getPath().getName();
> this.length = status.getLen();
> this.isdir = status.isDirectory();
> this.timestamp = status.getModificationTime();
> this.canRead =
> status.getPermission().getGroupAction().implies(FsAction.READ);
> this.canWrite =
> status.getPermission().getGroupAction().implies(FsAction.WRITE);
> this.canExecute =
> status.getPermission().getGroupAction().implies(FsAction.EXECUTE);
> this.encrypted = status.isEncrypted();
> }
> }
> {code}
> This introduced significant savings, allowing a {{List<HdfsEntity>}} to store
> 5.2 million objects instead of 1.13 million.
> It is worth noting here that the HdfsEntity doesn't store the 'group' and
> 'owner' that are part of the FileStatus. Interestingly, though, these values
> are strings that are extremely repetitive, but are unique strings on the
> heap. So a full solution here would mean tracking the owner and group, but
> doing so in a way that we use a Map<String, String> or something similar in
> order to reuse identical Strings on the heap (in much the same way that
> String.intern() does, but without interning the String as it's only reusable
> within a small context).
> These changes will yield significant memory improvements.
> However, more penetrating changes can make even large improvements:
> * Rather than keeping a Set or a List of entities at all, we could instead
> just iterate over each of the FileStatus objects returned from HDFS and
> determine whether or not the file should be included in the listing. If so,
> create the FlowFile or write the Record to the RecordWriter. Eliminate the
> collection all together. This would likely result in code that is similar to
> ListS3, which has an internal {{S3ObjectWriter}} class. This is used to
> create an interface that can be used regardless of whether a RecordWriter is
> being used or not. This would provide very significant memory improvements.
> In the case of using a Record Writer, it may even reduce the size to nearly 0.
> * However, when not using a Record Writer, we still will have the FlowFiles
> kept in memory until the session is committed. One method for dealing with
> this would be to change our algorithm so that instead of performing a single
> listing of the directory and all sub-directories, we instead sort
> sub-directories by name and then commit the session and update state for
> individual sub-directories. This would introduce significant risk, though, in
> ensuring that we don't create message duplication upon restart and that we
> don't lose data. So it's perhaps not the best option.
> * Alternatively, we could document the concern when not using a Record
> Writer and provide info in an additionalDetails.html that shows the preferred
> method for using the processor when expecting many files. ListHDFS would be
> connected to SplitRecord that would split into chunks of say 10,000 Records.
> This would then go to a PartitionRecord that would create attributes for the
> fields. This would give us the same result as outputting without the Record
> Writer but in such a way that is much more memory efficient.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)