Roger, Unless I misread you, you are implying that _id is stored increasingly less inefficiently in the .couch file as its length increases? I don't think, unless you've really dug into the disk structure, that this assertions will hold. A better measure is the number of btree nodes on disk for the different kinds of _id. I would expect to see more nodes (and a higher rate of 'turnover') for large random ids than equally as large sequential ids. Large random ids versus short sequential ids should show a larger difference but for two unrelated reasons.
B. On Mon, Jan 11, 2010 at 9:33 AM, Robert Newson <[email protected]> wrote: > I should point out that the sequential algorithm in couchdb is > carefully constructed so that generated ids won't clash, even in a > distributed system. You might have assumed that the sequential ids > were 1, 2, 3, 4, ... and so on, but they are not. > > The sequential ids are the same length as the random ids (16 bytes). > The first 13 bytes stay the same for around 8000 generated ids and is > then rerandomized. The ids with the same prefix have suffixes in > strictly increasing numeric order. This characteristic (that a new id > is numerically close to the previous id) is what helps with insertion > speed and general b-tree performance. > > Before changing the default I think it would be worth getting numbers > from a suitably fair benchmark, I would still advocate random as the > default until that is done. > > B. > > On Mon, Jan 11, 2010 at 12:51 AM, Chris Anderson <[email protected]> wrote: >> On Sun, Jan 10, 2010 at 4:24 PM, Roger Binns <[email protected]> wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Chris Anderson wrote: >>>> I'm not feeling super-strong about this. However, making the default >>>> sequential seems like it will preempt a lot of the problems people >>>> tend to show up asking about. >>> >> >> If we think that speed and size are more important than randomness, we >> should continue to refine uuid generators. >> >> Roger, if you can make a short sequential that'd be neat. >> >>> There are several issues conflated together: >>> >>> - - When doing inserts, sorted ids are faster >>> >>> - - The resulting size of the db file is the size of the docs plus a >>> multiple >>> of the _id size (and probably an exponential of the size) >>> >>> - - Sequential ids give small _id >>> >>> - - Random ids give large _id >>> >>> - - Sequentials will clash between different dbs (consider replication, >>> multiple instances etc). They'll also lead people to rely on this >>> functionality as though it was like a SQL primary key >>> >>> - - Random ids won't clash and better illustrate how CouchDB really works >>> >>>> I think the info-leakage argument is overblown >>> >>> It does make URLs easy to guess like many ecommerce sites that didn't >>> validate when showing you an invoice - you added one to the numeric id in >>> the URL and got to see someone elses. >>> >>> I would far prefer the size of the db file and the size of the _id link >>> being addressed. Because the _id size can cause the db file to get so big, >>> I/O etc is a lot slower mainly because there is just so much more file to >>> deal with! (In my original posting I had a db go from 21GB to 4GB by >>> reducings ids from 16 bytes to 4 bytes.) >>> >>> Roger >>> >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG v1.4.9 (GNU/Linux) >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >>> >>> iEYEARECAAYFAktKb8AACgkQmOOfHg372QRb0ACfRWu1TUOs3twwmOGgAUOwhLfx >>> FJkAoKgnkWnPayPtPqMfk3/AxOj2xaMx >>> =V7Zq >>> -----END PGP SIGNATURE----- >>> >>> >> >> >> >> -- >> Chris Anderson >> http://jchrisa.net >> http://couch.io >> >
