Re: [Catalyst] Hypothetical Site and Scalability Planning

2007-10-30 Thread Matt Rosin
One company mentioned their perl based large scale sns site at YAPC::Asia
IIRC - sorry it might have been six apart as mentioned above, can't remember
which. I do know they wrote their own system to be able to basically split
and merge their user pool according to user name (alphabetical order) and
splitting off to more servers when a given partition of the set got too
full, which simplifies things.
___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/[EMAIL PROTECTED]/
Dev site: http://dev.catalyst.perl.org/


RE: [Catalyst] Hypothetical Site and Scalability Planning

2007-10-28 Thread Johan Lindstr

At 21:58 2007-10-26, Mesdaq, Ali wrote:

I personally think that storing images in the DB is the best place to
start because if other better solutions are available later you can very
easily migrate. But if you start out with filesystem migration is a
little bit more cludgy in my opinion. I mean you have to go traverse
directories and copy/move/delete or whatever you have to do for the
migration.


I haven't tried serving images or large objects from a database this 
way, but wouldn't reading this data totally blow the db cache for the 
rest of the things the database needs to do? At least it's something 
I'd investigate and create spike solutions for before deploying. It 
sounds very much vendor and/or configuration dependent, so off to the 
manuals :)


One thing that could work though, letting each component do what it's 
best at, is to let the db store and _manage_ the media assets, but to 
write them to disk on first access (and clear out unused files in a 
cache-like way) and let the web server _serve_ them efficiently.



/J


___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/[EMAIL PROTECTED]/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Hypothetical Site and Scalability Planning

2007-10-28 Thread Brian Kirkbride

 wrote:

At 21:58 2007-10-26, Mesdaq, Ali wrote:

I personally think that storing images in the DB is the best place to
start because if other better solutions are available later you can very
easily migrate. But if you start out with filesystem migration is a
little bit more cludgy in my opinion. I mean you have to go traverse
directories and copy/move/delete or whatever you have to do for the
migration.


I haven't tried serving images or large objects from a database this 
way, but wouldn't reading this data totally blow the db cache for the 
rest of the things the database needs to do? At least it's something I'd 
investigate and create spike solutions for before deploying. It sounds 
very much vendor and/or configuration dependent, so off to the manuals :)


One thing that could work though, letting each component do what it's 
best at, is to let the db store and _manage_ the media assets, but to 
write them to disk on first access (and clear out unused files in a 
cache-like way) and let the web server _serve_ them efficiently.



/J


If putting large media files in the DB is really helpful to you in 
managing / replicating your data, you definitely do want to cache in a 
filesystem to serve directly via a dumb webserver process.  Then 
it's up to your application to invalidate those cached files on update.


In my case, I've found that Apache's mod_rewrite using a -f 
RewriteCond is your friend here.  Just test for the file in a cache/ 
directory and serve it.  If not, fall through to Catalyst App and then 
save the file off in your end() action.


___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/[EMAIL PROTECTED]/
Dev site: http://dev.catalyst.perl.org/


Re: [Catalyst] Hypothetical Site and Scalability Planning

2007-10-26 Thread J. Shirley
On 10/26/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:



 NFS gets a bad wrap,  as long as you do sane planning and lay it out
 properly NFS works very very well for servicing static files to the
 webservers.  Breaking out to S3 seems silly (Amazon is out to make money
 with S3 and if you do it yourself you should be able to do it for less
 cost),  KISS works wonders as long as you think about usability.  Get a
 Sysadmin to think out the NFS side realistically (Masters with multi read
 onlys etc).



NFS gets a deserved wrap.  It isn't anybody knocking it, but it is simply
not the best tool for the job, and it wasn't designed to be used in a way
that services millions of requests out... Sure, it can do it, but it isn't
meant to do it.

And using S3 is a great way to scale out and save money -- look at SmugMug.
Sure, you could do it yourself if you ignore the cost of a dedicated
sysadmin for your infrastructure.  If you're sufficiently talented, you can
easily manage the rest of your cluster -- storage is _ALWAYS_  a thorn when
you do it in house.  You need expertise.  If you have to get a sysadmin just
because you want to use NFS, all the cost savings of doing it in house just
blew up.

And if you do want to do it in house, compare it to MogileFS -- MogileFS may
not give you the same performance numbers as NFS, but it has tracking and
replication built-in; NFS misses that and requires you to do a lot of things
on your own.

If you really know what you are doing, NFS can be great.  If you don't, NFS
will haunt you in your dreams.  If you have to hire someone to maintain your
NFS cluster then I would recommend just paying the money to Amazon -- at
least Amazon can be written off like a business expense.

-J (who has nothing against NFS, just more in favor of other services)


-- 
J. Shirley :: [EMAIL PROTECTED] :: Killing two stones with one bird...
http://www.toeat.com
___
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/[EMAIL PROTECTED]/
Dev site: http://dev.catalyst.perl.org/


RE: [Catalyst] Hypothetical Site and Scalability Planning

2007-10-26 Thread Mesdaq, Ali
J,

Amazing feedback this is great! 

I think memcached is great. I haven't had time to play with it yet but I
have pretty much read everything and been prepped to play with it once I
have a chance.

I personally think that storing images in the DB is the best place to
start because if other better solutions are available later you can very
easily migrate. But if you start out with filesystem migration is a
little bit more cludgy in my opinion. I mean you have to go traverse
directories and copy/move/delete or whatever you have to do for the
migration.

We have been using mysql on some pretty big internal projects here and
its been working satisfactorily. However there are issues with it that
make me not so confident in these big claims of large sites using it.
Mainly its the scaling out paradigm that is not very clear with mysql.
We tried using replication with master slaves and the replication speed
was wayy too slow. Then the whole clustering approach with mysql
seems to be very confusing and not very documented as far as I have
poked around. The only really solid scaling approaches I have seen with
mysql is either using vmware to cluster hardware at the hardware/os/vm
layer to make one big virtual machine or using third party
hardware/software bundles with mysql like ones from NetApp or similar. I
wish clustering with mysql was as simple as adding a node to the cluster
and you gain 0.7 performance per machine.

Another very intriguing thing with super large sites is the actual
schema design. You have to be very smart about design, data segregation,
indexes, etc. I mean I don't know for sure but I am pretty sure sites
like myspace don't just have one huge users table with user_id, email,
sha1_password. I would imagine they have segregated users into separate
schemas which would scale far better than mysql replication or
clustering would. Something like every 10,000 users are allocated on a
new mysql server.

Thanks,
--
Ali Mesdaq
Security Researcher II
Websense Security Labs
http://www.WebsenseSecurityLabs.com
--

-Original Message-
From: J. Shirley [mailto:[EMAIL PROTECTED] 
Sent: Friday, October 26, 2007 12:31 PM
To: The elegant MVC web framework
Subject: Re: [Catalyst] Hypothetical Site and Scalability Planning

On 10/26/07, Mesdaq, Ali [EMAIL PROTECTED] wrote:

Hey All, 

Just wanted to start a thread about scalability planning and
design. I was thinking we could take the approach of what peoples
opinions, ideas, and best practices are for large scale sites and use a
hypothetical site or a existing site as the model to plan for. Not
everything discussed needs to be catalyst only it could be general web
server configs or something similar. 

For example how would you guys approach a project where you
needed to create a site like a myspace.com http://myspace.com  or
similar with 0 current users but could surpass 1 million users in 1
month then 100 million in 1 year. I am interested to see the opinions
and designs people would have to deal with that type of scalability. I
mean even simple issues become very complex with those numbers. Like
where and how to store photos. Should they be stored on filesystem, db,
or external sites like akamai. What web server should be used? Apache?
Should it be threaded version? How does that affect catalyst and its
modules are they all thread safe or is threaded apache not even the way
to go? 


Here's my opinions on the matter:
1) Start out with memcached in place.  It scales well, and use it.  Use
PageCache where you can.
2) Store images in something that is for storing data, not files.
Storing images as files means you are stuck with some file system format
that binds you unnecessarily.  Things like S3, Akamai or your own
homegrown MogileFS cluster gives you an API into the data.  Granted, you
could do the same for NFS or whatever, and just write a good
compatibility API, you are largely duplicating the work of the previous
tech.  If you use S3, setup your image servers to cache for a loong
time (on disk).  Pull from S3, and store it for as long as you
reasonably can.  This area a lot of people get wrong and then get stuck
with costly migrations. 
3) Use database replication strategies where you can.  In the F/OSS
world, MySQL is outshining PostgreSQL with this.  InnoDB removes a lot
of the complaints that folks have about MySQL but there is always
evangelism against MySQL.  If it works for you, just take it in stride -
a LOT of high traffic sites use MySQL; you can usually get some insight
from them.  MySQL allows InnoDB on the master, and MyISAM on the slaves
-- gets you faster read times, and tends to not block on inserts that
bad -- and then as you grow it is easier to grow into a full blown MySQL
cluster... but at that point, you have enough money to thoroughly
explore every option available. 
4) You'll have to tune Apache or whatever web