As Alex mentioned, there are tools that will keep filesystem metadata in a 
database and provide query tools.
NYGC uses Starfish and we’ve had good experience with it. At first the only 
feature we used is “sfdu” which is a quick replacement for recursive du. Using 
this we can script csv reports for selections of dirs. As we use starfish more, 
we’ve started opening the web interface to people to look at selected areas of 
our filesystems where they can sort directories by size, mtime, atime, and run 
other reports and queries. We’ve also started using tagging functionality so we 
can quickly get an aggregate total (and growth over time) by tag across 
multiple directories.

We tried Robinhood years ago but found it was taking too much work to get it to 
scale to 100s of millions of files and 10s of PiB on gpfs. It might be better 
now.

IBM has a metadata product called Spectrum Discover that has the benefit of 
using gpfs-specific interfaces to be always up to date. Many of the other tools 
require scheduling scans to update the db.
Igneous has a commercial tool called DataDiscover which also looked promising. 
ClarityNow and MediaFlux are other similar tools.
I expect all of these tools at the very least have nice replacements for du and 
find as well as some sort of web directory tree view.

We had run Starfish for a while and did a re-evaluation of a few options in 
2019 and ultimately decided to stay with Starfish for now.

Best,
Chris

From: <[email protected]> on behalf of Alex Chekholko 
<[email protected]>
Reply-To: gpfsug main discussion list <[email protected]>
Date: Friday, April 3, 2020 at 7:51 PM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] fast search for archivable data sets

Hi Jim,

The common non-GPFS-specific way is to use a tool that dumps all of your 
filesystem metadata into an SQL database and then you can have a webapp that 
makes nice graphs/reports from the SQL database, or do your own queries.

The Free Software example is "Robinhood" (use the POSIX scanner, not the 
lustre-specific one) and one proprietary example is Starfish.

In both cases, you need a pretty beefy machine for the DB and the scanning of 
your filesystem may take a long time, depending on your filesystem performance. 
 And then without any filesystem-specific hooks like a transaction log, you'll 
need to rescan the entire filesystem to update your db.

Regards,
Alex

On Fri, Apr 3, 2020 at 3:25 PM Jim Kavitsky 
<[email protected]<mailto:[email protected]>> wrote:
Hello everyone,
I'm managing a low-multi-petabyte Scale filesystem with hundreds of millions of 
inodes, and I'm looking for the best way to locate archivable directories. For 
example, these might be directories where whose contents were greater than 5 or 
10TB, and whose contents had atimes greater than two years.

Has anyone found a great way to do this with a policy engine run? If not, is 
there another good way that anyone would recommend? Thanks in advance,

Jim Kavitsky
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at 
spectrumscale.org<https://urldefense.com/v3/__http:/spectrumscale.org__;!!C6sPl7C9qQ!Cs4pKCeiQY8iPQeiCnSUIUHDC9FHjx7C64p_WVNeQsaF4ODzO9o7NkBoUT6E2Y-C$>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss<https://urldefense.com/v3/__http:/gpfsug.org/mailman/listinfo/gpfsug-discuss__;!!C6sPl7C9qQ!Cs4pKCeiQY8iPQeiCnSUIUHDC9FHjx7C64p_WVNeQsaF4ODzO9o7NkBoUWuphbie$>
________________________________
This message is for the recipient’s use only, and may contain confidential, 
privileged or protected information. Any unauthorized use or dissemination of 
this communication is prohibited. If you received this message in error, please 
immediately notify the sender and destroy all copies of this message. The 
recipient should check this email and any attachments for the presence of 
viruses, as we accept no liability for any damage caused by any virus 
transmitted by this email.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to