Re: [PERFORM] Need suggestion high-level suggestion on how to solve a performance problem

PFC Thu, 07 Jul 2005 15:31:19 -0700


        Hello,

I once upon a time worked in a company doing backup software and Iremember these problems, we had exactly the same !The file tree was all into memory and everytime the user clicked onsomething it haaad to update everything. Being C++ it was very fast, butto backup a million files you needed a gig of RAM, which is... a problemlet's say, when you think my linux laptop has about 400k files on it.So we rewrote the project entirely with the purpose of doing the millionfiles thingy with the clunky Pentium 90 with 64 megabytes of RAM, and itworked.

        What I did was this :
        - use Berkeley DB

Berkeley DB isn't a database like postgres, it's just a tree, but it'scool for managing trees. It's quite fast, uses key compression, etc.

        It has however a few drawbacks :

- files tend to fragment a lot over time and it can't reindex or vacuumlike postgres. You have to dump and reload.- the price of the licence to be able to embed it in your product andsell it is expensive, and if you want crash-proof, it's insanely expensive.- Even though it's a tree it has no idea what a parent is so you have tomess with that manually. We used a clever path encoding to keep all thepaths inside the same directory close in the tree ; and separated databasefor dirs and files because we wanted the dirs to be in the cache, whereaswe almost never touched the files.


        And...

You can't make it if you update every node everytime the user clicks onsomething. You have to update 1 node.

        In your tree you have nodes.

Give each node a state being one of these three : include, exclude,inheritWhen you fetch a node you also fetch all of its parents, and youpropagate the state to know the state of the final node.

        If a node is in state 'inherit' it is like its parent, etc.

So you have faster updates but slower selects. However, there is a bonus: if you check a directory as "include" and one of its subdirectory as"exclude", and the user adds files all over the place, the files added inthe "included" directory will be automatically backed up and the ones inthe 'ignored' directory will be automatically ignored, you have nothing tochange.And it is not that slow because, if you think about it, suppose you have/var/www/mysite/blah with 20.000 files in it, in order to inherit thestate of the parents on them you only have to fetch /var once, www once,etc.So if you propagate your inherited properties when doing a tree traversalit comes at no cost.

        
        IMHO it's the only solution.

It can be done quite easily also, using ltree types and a little storedprocedures, you can even make a view which gives the state of eachelement, computed by inheritance.

Here's the secret : the user will select 100.000 files by clicking on adirectory near root, but the user will NEVER look at 100.000 files. So youcan make looking at files 10x slower if you can make including/excludingdirectories 100.000 times faster.

Now you'll ask me, but how do I calculate the total size of the backupwithout looking at all the files ? when I click on a directory I don'tknow what files are in it and which will inherit and which will not.

It's simple : you precompute it when you scan the disk for changed files.This is the only time you should do a complete tree exploration.

On each directory we put a matrix [M]x[N], M and N being one of the threeabove state, containing the amount of stuff in the directory which wouldbe in state M if the directory was in state N. This is very easy tocompute when you scan for new files. Then when a directory changes state,you have to sum a few cells of that matrix to know how much more that addsto the backup. And you only look up 1 record.


        Is that helpful ?
















---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Re: [PERFORM] Need suggestion high-level suggestion on how to solve a performance problem

Reply via email to