Re: [Catalyst] Hypothetical Site and Scalability Planning
One company mentioned their perl based large scale sns site at YAPC::Asia IIRC - sorry it might have been six apart as mentioned above, can't remember which. I do know they wrote their own system to be able to basically split and merge their user pool according to user name (alphabetical order) and splitting off to more servers when a given partition of the set got too full, which simplifies things. ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/[EMAIL PROTECTED]/ Dev site: http://dev.catalyst.perl.org/
RE: [Catalyst] Hypothetical Site and Scalability Planning
At 21:58 2007-10-26, Mesdaq, Ali wrote: I personally think that storing images in the DB is the best place to start because if other better solutions are available later you can very easily migrate. But if you start out with filesystem migration is a little bit more cludgy in my opinion. I mean you have to go traverse directories and copy/move/delete or whatever you have to do for the migration. I haven't tried serving images or large objects from a database this way, but wouldn't reading this data totally blow the db cache for the rest of the things the database needs to do? At least it's something I'd investigate and create spike solutions for before deploying. It sounds very much vendor and/or configuration dependent, so off to the manuals :) One thing that could work though, letting each component do what it's best at, is to let the db store and _manage_ the media assets, but to write them to disk on first access (and clear out unused files in a cache-like way) and let the web server _serve_ them efficiently. /J ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/[EMAIL PROTECTED]/ Dev site: http://dev.catalyst.perl.org/
Re: [Catalyst] Hypothetical Site and Scalability Planning
wrote: At 21:58 2007-10-26, Mesdaq, Ali wrote: I personally think that storing images in the DB is the best place to start because if other better solutions are available later you can very easily migrate. But if you start out with filesystem migration is a little bit more cludgy in my opinion. I mean you have to go traverse directories and copy/move/delete or whatever you have to do for the migration. I haven't tried serving images or large objects from a database this way, but wouldn't reading this data totally blow the db cache for the rest of the things the database needs to do? At least it's something I'd investigate and create spike solutions for before deploying. It sounds very much vendor and/or configuration dependent, so off to the manuals :) One thing that could work though, letting each component do what it's best at, is to let the db store and _manage_ the media assets, but to write them to disk on first access (and clear out unused files in a cache-like way) and let the web server _serve_ them efficiently. /J If putting large media files in the DB is really helpful to you in managing / replicating your data, you definitely do want to cache in a filesystem to serve directly via a dumb webserver process. Then it's up to your application to invalidate those cached files on update. In my case, I've found that Apache's mod_rewrite using a -f RewriteCond is your friend here. Just test for the file in a cache/ directory and serve it. If not, fall through to Catalyst App and then save the file off in your end() action. ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/[EMAIL PROTECTED]/ Dev site: http://dev.catalyst.perl.org/
Re: [Catalyst] Hypothetical Site and Scalability Planning
On 10/26/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: NFS gets a bad wrap, as long as you do sane planning and lay it out properly NFS works very very well for servicing static files to the webservers. Breaking out to S3 seems silly (Amazon is out to make money with S3 and if you do it yourself you should be able to do it for less cost), KISS works wonders as long as you think about usability. Get a Sysadmin to think out the NFS side realistically (Masters with multi read onlys etc). NFS gets a deserved wrap. It isn't anybody knocking it, but it is simply not the best tool for the job, and it wasn't designed to be used in a way that services millions of requests out... Sure, it can do it, but it isn't meant to do it. And using S3 is a great way to scale out and save money -- look at SmugMug. Sure, you could do it yourself if you ignore the cost of a dedicated sysadmin for your infrastructure. If you're sufficiently talented, you can easily manage the rest of your cluster -- storage is _ALWAYS_ a thorn when you do it in house. You need expertise. If you have to get a sysadmin just because you want to use NFS, all the cost savings of doing it in house just blew up. And if you do want to do it in house, compare it to MogileFS -- MogileFS may not give you the same performance numbers as NFS, but it has tracking and replication built-in; NFS misses that and requires you to do a lot of things on your own. If you really know what you are doing, NFS can be great. If you don't, NFS will haunt you in your dreams. If you have to hire someone to maintain your NFS cluster then I would recommend just paying the money to Amazon -- at least Amazon can be written off like a business expense. -J (who has nothing against NFS, just more in favor of other services) -- J. Shirley :: [EMAIL PROTECTED] :: Killing two stones with one bird... http://www.toeat.com ___ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/[EMAIL PROTECTED]/ Dev site: http://dev.catalyst.perl.org/
RE: [Catalyst] Hypothetical Site and Scalability Planning
J, Amazing feedback this is great! I think memcached is great. I haven't had time to play with it yet but I have pretty much read everything and been prepped to play with it once I have a chance. I personally think that storing images in the DB is the best place to start because if other better solutions are available later you can very easily migrate. But if you start out with filesystem migration is a little bit more cludgy in my opinion. I mean you have to go traverse directories and copy/move/delete or whatever you have to do for the migration. We have been using mysql on some pretty big internal projects here and its been working satisfactorily. However there are issues with it that make me not so confident in these big claims of large sites using it. Mainly its the scaling out paradigm that is not very clear with mysql. We tried using replication with master slaves and the replication speed was wayy too slow. Then the whole clustering approach with mysql seems to be very confusing and not very documented as far as I have poked around. The only really solid scaling approaches I have seen with mysql is either using vmware to cluster hardware at the hardware/os/vm layer to make one big virtual machine or using third party hardware/software bundles with mysql like ones from NetApp or similar. I wish clustering with mysql was as simple as adding a node to the cluster and you gain 0.7 performance per machine. Another very intriguing thing with super large sites is the actual schema design. You have to be very smart about design, data segregation, indexes, etc. I mean I don't know for sure but I am pretty sure sites like myspace don't just have one huge users table with user_id, email, sha1_password. I would imagine they have segregated users into separate schemas which would scale far better than mysql replication or clustering would. Something like every 10,000 users are allocated on a new mysql server. Thanks, -- Ali Mesdaq Security Researcher II Websense Security Labs http://www.WebsenseSecurityLabs.com -- -Original Message- From: J. Shirley [mailto:[EMAIL PROTECTED] Sent: Friday, October 26, 2007 12:31 PM To: The elegant MVC web framework Subject: Re: [Catalyst] Hypothetical Site and Scalability Planning On 10/26/07, Mesdaq, Ali [EMAIL PROTECTED] wrote: Hey All, Just wanted to start a thread about scalability planning and design. I was thinking we could take the approach of what peoples opinions, ideas, and best practices are for large scale sites and use a hypothetical site or a existing site as the model to plan for. Not everything discussed needs to be catalyst only it could be general web server configs or something similar. For example how would you guys approach a project where you needed to create a site like a myspace.com http://myspace.com or similar with 0 current users but could surpass 1 million users in 1 month then 100 million in 1 year. I am interested to see the opinions and designs people would have to deal with that type of scalability. I mean even simple issues become very complex with those numbers. Like where and how to store photos. Should they be stored on filesystem, db, or external sites like akamai. What web server should be used? Apache? Should it be threaded version? How does that affect catalyst and its modules are they all thread safe or is threaded apache not even the way to go? Here's my opinions on the matter: 1) Start out with memcached in place. It scales well, and use it. Use PageCache where you can. 2) Store images in something that is for storing data, not files. Storing images as files means you are stuck with some file system format that binds you unnecessarily. Things like S3, Akamai or your own homegrown MogileFS cluster gives you an API into the data. Granted, you could do the same for NFS or whatever, and just write a good compatibility API, you are largely duplicating the work of the previous tech. If you use S3, setup your image servers to cache for a loong time (on disk). Pull from S3, and store it for as long as you reasonably can. This area a lot of people get wrong and then get stuck with costly migrations. 3) Use database replication strategies where you can. In the F/OSS world, MySQL is outshining PostgreSQL with this. InnoDB removes a lot of the complaints that folks have about MySQL but there is always evangelism against MySQL. If it works for you, just take it in stride - a LOT of high traffic sites use MySQL; you can usually get some insight from them. MySQL allows InnoDB on the master, and MyISAM on the slaves -- gets you faster read times, and tends to not block on inserts that bad -- and then as you grow it is easier to grow into a full blown MySQL cluster... but at that point, you have enough money to thoroughly explore every option available. 4) You'll have to tune Apache or whatever web