[
https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029690#comment-13029690
]
Stephen Kestle commented on IO-271:
-----------------------------------
Late mod. date updating would be needed in edge cases around merging
directories and detecting if a file had successfully been copied. This is due
to "holes" that could form between batches.
After talking with others today, we came up with the idea of using a
Incremental file filter that does the copy operation, and then returns false,
so that the list of Files does not grow.
My estimation of memory usage is actually fully incorrect - listFiles() is far
worse:
# It calls {{list()}} (everything does, it's a native method)
# It allocates a new Array for the files
# It creates the files and (on linux) resolves a new string for the full path
of the file. So the deeper this directory is that has many files, the longer
the path will be (I was only doing one short directory name when I said double
memory usage)
* If you're using the {{listFiles(FileFilter)}} method, an {{ArrayList}} is
populated, and then copied to an array at the end, using more memory.
*Notes:*
* Trying to find out how much memory is used *while* {{File}} is performing
it's internal copies and resolves is not trivial
* my memory use calculations (107 bytes vs 60 bytes for 10 char files in a 4
char directory) were after I'd done {{System.gc()}}.
* If I skipped the {{gc}} the Files took 167 bytes at the point of measuring
after a 5 second sleep
* Our ant tests (where this all started) seems to indicate that (for 500,000
files, under the same conditions as my test above)
** {{File.list()}} (which ant's copy initially uses) requires around 30Mb
** {{File.listFiles()}} (which commons-io uses) requires around 150Mb
** These requirements were found by limiting the JVM Xmx settings until the
respective {{File.list*()}} passed without a OOME.
I will post more conclusive results soon once I've done some more tests using
Xmx with only the directory listing methods.
> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
> Key: IO-271
> URL: https://issues.apache.org/jira/browse/IO-271
> Project: Commons IO
> Issue Type: Improvement
> Components: Utilities
> Affects Versions: 2.0.1
> Reporter: Stephen Kestle
> Priority: Minor
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().
> The latter should be used in doCopyDirectory where there is no filter
> specified.
> This memory usage is a problem when copying directories with hundreds of
> thousands of files.
> I was also thinking of the option of implementing a file filter (that could
> be composed with the inputted filter) that would batch the file copy
> operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of
> File.listFiles(), there would need to be a final file filter that would
> accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate
> this beforehand), but for general operation where it's a merge, the
> modification date re-writing should only be done in the final run of copies
> so that while batching occurs (and indeed the final "missed" filtering) files
> do not get copied if they have been modified after the start time. (I presume
> that I'm reading FileUtils correctly in that it overrides files...)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira