Re: [PERFORM] Need suggestion high-level suggestion on how to solve

Madison Kelly Thu, 07 Jul 2005 16:40:40 -0700

PFC wrote:

    Hello,
I once upon a time worked in a company doing backup software and Iremember these problems, we had exactly the same !


  Prety neat. :)

The file tree was all into memory and everytime the user clicked onsomething it haaad to update everything. Being C++ it was very fast,but to backup a million files you needed a gig of RAM, which is... aproblem let's say, when you think my linux laptop has about 400k fileson it.

I want this to run on "average" systems (I'm developing it primarilyon my modest P3 1GHz Thinkpad w/ 512MB RAM running Debian) so expectingthat much free memory is not reasonable. As it is my test DB, with arealistic amount of data, is ~150MB.

So we rewrote the project entirely with the purpose of doing themillion files thingy with the clunky Pentium 90 with 64 megabytes ofRAM, and it worked.
    What I did was this :
    - use Berkeley DB

<snip>

- the price of the licence to be able to embed it in your productand sell it is expensive, and if you want crash-proof, it's insanelyexpensive.

This is the kicker right there; my program is released under the GPLso it's fee-free. I can't eat anything costly like that. As it is thereis hundreds and hundreds of hours in this program that I am alreadyhoping to recoup one day through support contracts. Adding commercialsoftware I am afraid is not an option.

bonus : if you check a directory as "include" and one of itssubdirectory as "exclude", and the user adds files all over the place,the files added in the "included" directory will be automaticallybacked up and the ones in the 'ignored' directory will be automaticallyignored, you have nothing to change.

<snip>

    IMHO it's the only solution.

Now *this* is an idea worth looking into. How I will implement itwith my system I don't know yet but it's a new line of thinking. Wonderful!

Now you'll ask me, but how do I calculate the total size of thebackup without looking at all the files ? when I click on a directory Idon't know what files are in it and which will inherit and which will not.
It's simple : you precompute it when you scan the disk for changedfiles. This is the only time you should do a complete tree exploration.

This is already what I do. When a user selects a partition they wantto select files to backup or restore the partition is scanned. The scanlooks at every file, directory and symlink and records it's size (ondisk), it mtime, owner, group, etc. and records it to the database. I'vegot this scan/update running at ~1,500 files/second on my laptop. Thatwas actually the first performance tuning I started with. :)

With all the data in the DB the backup script can calculate ratherintelligently where it wants to copy each directory to.

On each directory we put a matrix [M]x[N], M and N being one of thethree above state, containing the amount of stuff in the directorywhich would be in state M if the directory was in state N. This is veryeasy to compute when you scan for new files. Then when a directorychanges state, you have to sum a few cells of that matrix to know howmuch more that adds to the backup. And you only look up 1 record.

In my case what I do is calculate the size of all the files selectedfor backup in each directory, sort the directories from all sources bythe total size of all their selected files and then start assigning thedirectories, largest to smallest to each of my available destinationmedias. If it runs out of destination space it backs up what it can andthen waits a user-definable amount of time and then checks to see if anynew destination media has been made available. If so it again tries toassign the files/directories that didn't fit. It will loop auser-definable number of times before giving up and warning the userthat more destination space is needed for that backup job.

    Is that helpful ?

The three states (inhertied, backup, ignore) has definately caught myattention. Thank you very much for your idea and lengthy reply!


Madison

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Re: [PERFORM] Need suggestion high-level suggestion on how to solve

Reply via email to