Re: Whats wrong with AWS & other cloud tech?

S. Dale Morrey Thu, 01 Dec 2011 14:37:34 -0800

> This thread has been sitting in my inbox waiting for me to have time to read 
> it.  I saved it specifically because I am in this boat.  I appreciate all who 
> have contributed on both sides.
>
> I would like to suggest that I think this would be an amazing topic for a 
> PLUG meeting, or even a series of meetings.
>
> I am currently working on a (potentially very large) application project 
> which is currently being hosted in "the cloud" for sake of cost, with the 
> intention that in the possibly near future we would look to a more 
> traditional datacenter solution.  However, the concept of using a different 
> mindset to make the cloud actually be the more viable solution intrigues me.  
> I have heard reports of resource latency affecting large applications which 
> make the cloud a potentially bad place to be.  But if there are in fact ways 
> to mitigate limitations that may simply be caused by using traditional 
> mindsets, I want to know more.
>
> Thoughts?
>
> Jonathan



We let's look at a quick case study.
Our first significant "cloud" based project is www.lccatalog.com
It's an ecommerce site that we decided couldn't, tolerate downtime,
since downtime translates quite literally into lost sales and lost
reputation.

In a traditional small ecommerce app you have a webserver and a
database server running on the same machine.  This gives you a single
point of failure since if the box fails you have no website and no DB.

For a larger install you might decide to split the DB server off onto
a separate physical box so that the webserver could be dedicated
solely to serving up web pages.  Doing that though increases latency
since you need to use a remote DB connection.  You have also
introduced 2 distinct points of failure, your webserver could go down,
or your DB server could go down.  If either goes down your customer
will see it as a broken website.

To address this issue we decided to not use a DB server at all and
instead structure the ecommerce application to use AWS SimpleDB  which
is a NOSQL schemaless DB that is essentially a large hashmap.  However
it does present itself as a traditional DB and calls are executed on
the backend via HTTP GET/POST instead of a more traditional ODBC/JDBC
setup.  (Our app is in Java and runs on Tomcat).

We could have chosen SimpleRDS, but felt that SimpleDB was a better
choice since we don't need all the features of a full on RDS, such as
multi-table queries, joins and stored procedures.

Going this route allowed us to get rid of the DB as a single point of
failure at the cost of having much slower DB access in some instances.

We worked around that by implementing a sort of memcache system in the
application.

Basically any Select request is first checked against a recent results
list (in our case anything < 24hrs is considered recent).  If there
was already a result in the cache we return the cached result.  This
is lightning fast.  If however a result was not cached or it has
expired then we query SimpleDB for the object which generally has a
turn around time of less 50ms for most result sets, however larger
result sets (such as exporting the entire product DB) may end up
taking as much as 5 seconds.

Something to note about SimpleDB and other schemaless solutions such
as AppEngine's DataStore is that there is an implicit limit to result
set size.  If you reach that limit it will return all of the results,
thus far and then you have query the API for a "Next" token and
resubmit until no more "next" tokens are present.  This changes the
way that you think about it.  Also storing multiple objects into the
DB requires using a batch process, that in the case of SimpleDB seems
to have a limit of 25 objects at a go.  This made writing the product
import and editing tools a rather interesting process.

Anyways building the application this way gives us what I consider
great performance but we took it a step further.  We use Amazons
CloudFront and an S3 bucket to serve all static media files.  This
keeps our WAR file size smaller and allows us to serve static media
content such as our css, javascripts and of course images at a server
sitting at an edge location close to the user making the request.
Again performance is increased because images are already physically
closer to the end-user and if an we need to change an image we only
have to change it in the S3 bucket instead of uploading a new WAR and
telling all running instances to use that new archive for image data.

Finally we used Amazon's Elastic Beanstalk for deployment.  We run a
minimum of 2 instances using the "ANY2"  method which keeps our
instances in distinctly different physical locations.  That way if one
datacenter goes belly up, we still have a server running somewhere
else and EB handles firing up additional instances during peak load
times, and shutting the extra instances down when the load subsides.
It does so seamlessly, we don't have to do anything different to
auto-scale.

Our total AWS infrastructure bill for the month of November was $25.00
USD.  However we are still under the "free trial period" and the
calculator shows that similar usage outside of the free trial will run
us about $60 per month.  I consider that rather reasonable for what we
are getting.  The trade-off being we had to write the entire ecommerce
app from scratch (although a month after we started down the "from
scratch road" we learned about SimpleJPA which would have allowed us
to use any Java EE ecommerce app and simply modify it to tastes, oh
well).


That's the good news.  The bad news is that SimpleDB uses a default of
"eventually" consistent results and as far a I can tell there is
nothing in the Java API that would allow us to override that.  This
has directly been the cause of a significant bug where if a customer
views their cart (which is a session variable) and then moves to the
checkout screen, about 1 in 10 times their checkout screen will show a
$0 dollar total.  This is because when a customer moves from the view
cart screen to the checkout screen, the session cart is inserted into
the DB as "cart" object then the checkout screen then queries the DB
for a cart object tied to the customer's sesssion ID and treats the DB
cart results as cannonical information (because checkout may well
happen on a different instance).

Fortunately the consistency corrects itself in a second or two, so by
the time they actually input credit card information and hit submit we
do process the correct total to the credit card.  Also we have an
auditor (usually the call center manager), sanity check each
transaction before sending a ship product order to fulfillment, so
thus far no one has received any unnecessarily free products, at least
not to my knowledge :-)

But still bugs like this are disconcerting and will require a
"rethink" in the way we handle this section of the order process.

Most times though the customer sees the $0 total and either hits the
back button or the  refresh button or calls customer service (who tell
them to hit refresh).

We have also introduced a 2 second sleep into the redirect from cart
to checkout that has all but eliminated this problem.  Still it feels
like a hack and I'm not happy with that.  But it does work correctly
90% of the time.

In a nutshell thinking differently from the beginning allows you to
create rock solid stable, fast and scalable applications, but requires
a completely different mindset than what you may be used to.  You have
to start thinking of the entire hardware stack as a fungible asset
that may be replaced at any moment with another completely fungible
assets.

The only guarantee you really get is that if your code runs on one
box, it will run the same (or possibly better), on any of the other
boxes that may replace the one you started with.

You also have a situation where data may at anytime be in a transient
state and so a "measure twice, cut once" mindset is required when
thinking about your application and it's data.

Finally for us grey beards, direct SSH administration of a box while
possible, is not advisable since changes don't necessarily make the
transition from one box to another.

By using an S3 backed EBS volume and mounting it to each instance to
use as a datastore for configuration data, you can negate that
somewhat, but I haven't found a way to do this in seamlessly under
Elastic Beanstalk.

Furthermore I have had EBS volumes suddenly become unusable or
disappear completely, leaving me to have to rebuild them from backups.

I have also had certain individual media files disappear from S3 but
at least S3 is smart enough to let me know about it automatically (via
an email alert), so I can get it back up as soon as possible.  This
happened to our company logo once during a peak load time and while it
was embarrassing, it wasn't fatal.

If you store all configuration in a WAR file you negate most of the
above issues though.
And S3 has been at least as reliable if not more so than many "server
grade" drives we've used on web-servers in the past.

So those are my thoughts, anyone else care to chime in?

/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Re: Whats wrong with AWS & other cloud tech?

Reply via email to