I have been using a lockless memcache queue for something similar. It
is only for data that it is ok if you lose some of it. But it scales
quite well.

The implementation here is a little out of date:
http://www.redredred.com.au/memcache-lockless-queue-implementation/. I
plan to update with my current version soon.

j

On Oct 6, 11:00 am, neil souza <[email protected]> wrote:
> thanks nick, responses in line (hope they come through right, i don't
> really know how to use groups)
>
> On Sep 30, 2:57 am, "Nick Johnson (Google)" <[email protected]>
> wrote:
>
> > Hi Neil,
>
> > Sorry for the delay responding. Responses inline.
>
> > On Sat, Sep 26, 2009 at 1:40 PM, neil souza <[email protected]> wrote:
>
> > > the issue: it looks like we may not be getting all of our log entries
> > > when when pull the logs from app engine.
>
> > > first, a little context. there's a lot here, so bear with me.
>
> > > we need to record event lines for metrics. normally, we would write
> > > the lines to a local file on each app server and then pull those logs
> > > every few minutes from the metrics system. we found this to be the
> > > most stable and scalable architecture.
>
> > > however, in app engine land, we can't write to a file. so we wrote the
> > > event lines to the logs, set up a script to pull them in 10 minute
> > > intervals, and loaded them into the stats system.
>
> > > to be clear, the process goes like this:
>
> > > 1.) an event happens on the server that we'd like to record. we write
> > > a line to the log using logging.info(...) in python
>
> > > 2.) every 10 minutes, a job starts on a metrics server, which requests
> > > the next batch of logs by calling appcfg.py. the last log in the new
> > > batch is kept in a append file to use as the 'sentinel' for the next
> > > fetch.
>
> > > 3.) the new log file is parsed for event lines, which are written to
> > > another 'event' file.
>
> > > 4.) a few minutes later, another job grabs new event files and loads
> > > the events into the metrics system.
>
> > > everything seemed to work great. until we realized that we were
> > > missing events. a lot of them. between 20-50%.
>
> > > there are some events that need to be shared with other systems. for
> > > one if those event types, i was feeling lazy, so i just fired http
> > > hits at the other system as the event happen. at some point, we
> > > compared these numbers - and found them to be drastically different.
>
> > > i ran tests today comparing the number of events recorded 'through'
> > > the logs system and the same events recorded by http hit during
> > > runtime. the percent of 'missing' events ranged from 18-56%, and the
> > > percent missing appeared to be significantly higher when the frequency
> > > of events was higher (during peak).
>
> > > i've done a significant amount of work that points to the logs being
> > > missing by the point that appcfg.py records them. i've reasonably
> > > verified that all the event lines that appcfg.py pulls down make it
> > > into the metrics system. oh, and all the numbers are being run on
> > > unique user counts, so there's no way that the counts could be
> > > mistakenly large (accidentally reading an event twice does not produce
> > > a new unique user id).
>
> > > my questions / issues:
>
> > > 1.) should we be getting all of our logged lines from appcfg.py's
> > > request_logs command? is this a known behavior? recall that we are
> > > missing 20-50% of events - this is not a small discrepancy.
>
> > App Engine has a fixed amount of space available for logs; it's essentially
> > a circular buffer. When it runs out of space, it starts replacing older
> > logs.
>
> well, i'm going to guess that's the culprit.
>
> > > 2.) we're pulling our logs every 10 minutes. seeing as the
> > > request_logs command lets you specify the time you want in days, i
> > > imagine this as more frequent than intended. could this be causing an
> > > issue?
>
> > How much traffic are you getting? What's the size of 10 minutes' worth of
> > logs?
>
> we're at maybe avg. 200 request and looks like we're recording 1.25
> events per request, so perhaps 250 log lines / sec? that's in addition
> to all the other junk getting spilled out there - i didn't know that
> space was limited, there's prob some debug output, then the
> exceptions, etc...
>
>
>
>
>
> > > 3.) we switch major versions of the app every time we push, which can
> > > be several times each day. this doesn't make sense as an issue since
> > > the numbers are know to be wrong over periods where there have been no
> > > version changes, but i wanted to mention it.
>
> > > 4.) can you suggest a better solution for getting data to offline
> > > processing? right now we getting the correct numbers using the async
> > > http requests without ever calling get_result() or the like as a 'fire-
> > > and-forget' http hit (not even sure if we're supposed to use it like
> > > this, but seems to work). however, this approach has serious
> > > drawbacks:
>
> > You could log to the datastore, and read and delete old entries using
> > remote_api.
>
> this just doesn't seem like the right job for the datastore - we're
> only inserting at 250 events / sec right now, but need to be able to
> scale that. if we're inserting a few thousand events per second, and
> can only fetch or delete 1K at a time, that seems like a potential
> problem. we can batch up the events in each request, but still only
> limits the inserts per second to the app's requests per second, which
> can have the same issue. just doesn't sound fun.
>
> maybe http requests are the best way to do it for now, unless we can
> get the log space significantly expanded or find another solution.
>
>
>
> > > a.) http requests are very slow and expensive for something that does
> > > not need to happen immediately.
>
> > > b.) if the metrics system endpoint becomes unavailable, then, at best,
> > > the data gets lost. at worst the issue domino's back up into the http
> > > servers and takes the app down as well. (each http request has to
> > > timeout, which takes significantly longer, spiking the concurrent
> > > connections. this has screwed me multiple times. maybe you google guys
> > > mark an endpoint as down system-wide so that subsequent requests never
> > > attempt the connection, but we were never that smart).
>
> > > thanks in advance, neil.
>
> > --
> > Nick Johnson, Developer Programs Engineer, App Engine
> > Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number:
> > 368047
>
>
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to