I have been using a lockless memcache queue for something similar. It is only for data that it is ok if you lose some of it. But it scales quite well.
The implementation here is a little out of date: http://www.redredred.com.au/memcache-lockless-queue-implementation/. I plan to update with my current version soon. j On Oct 6, 11:00 am, neil souza <[email protected]> wrote: > thanks nick, responses in line (hope they come through right, i don't > really know how to use groups) > > On Sep 30, 2:57 am, "Nick Johnson (Google)" <[email protected]> > wrote: > > > Hi Neil, > > > Sorry for the delay responding. Responses inline. > > > On Sat, Sep 26, 2009 at 1:40 PM, neil souza <[email protected]> wrote: > > > > the issue: it looks like we may not be getting all of our log entries > > > when when pull the logs from app engine. > > > > first, a little context. there's a lot here, so bear with me. > > > > we need to record event lines for metrics. normally, we would write > > > the lines to a local file on each app server and then pull those logs > > > every few minutes from the metrics system. we found this to be the > > > most stable and scalable architecture. > > > > however, in app engine land, we can't write to a file. so we wrote the > > > event lines to the logs, set up a script to pull them in 10 minute > > > intervals, and loaded them into the stats system. > > > > to be clear, the process goes like this: > > > > 1.) an event happens on the server that we'd like to record. we write > > > a line to the log using logging.info(...) in python > > > > 2.) every 10 minutes, a job starts on a metrics server, which requests > > > the next batch of logs by calling appcfg.py. the last log in the new > > > batch is kept in a append file to use as the 'sentinel' for the next > > > fetch. > > > > 3.) the new log file is parsed for event lines, which are written to > > > another 'event' file. > > > > 4.) a few minutes later, another job grabs new event files and loads > > > the events into the metrics system. > > > > everything seemed to work great. until we realized that we were > > > missing events. a lot of them. between 20-50%. > > > > there are some events that need to be shared with other systems. for > > > one if those event types, i was feeling lazy, so i just fired http > > > hits at the other system as the event happen. at some point, we > > > compared these numbers - and found them to be drastically different. > > > > i ran tests today comparing the number of events recorded 'through' > > > the logs system and the same events recorded by http hit during > > > runtime. the percent of 'missing' events ranged from 18-56%, and the > > > percent missing appeared to be significantly higher when the frequency > > > of events was higher (during peak). > > > > i've done a significant amount of work that points to the logs being > > > missing by the point that appcfg.py records them. i've reasonably > > > verified that all the event lines that appcfg.py pulls down make it > > > into the metrics system. oh, and all the numbers are being run on > > > unique user counts, so there's no way that the counts could be > > > mistakenly large (accidentally reading an event twice does not produce > > > a new unique user id). > > > > my questions / issues: > > > > 1.) should we be getting all of our logged lines from appcfg.py's > > > request_logs command? is this a known behavior? recall that we are > > > missing 20-50% of events - this is not a small discrepancy. > > > App Engine has a fixed amount of space available for logs; it's essentially > > a circular buffer. When it runs out of space, it starts replacing older > > logs. > > well, i'm going to guess that's the culprit. > > > > 2.) we're pulling our logs every 10 minutes. seeing as the > > > request_logs command lets you specify the time you want in days, i > > > imagine this as more frequent than intended. could this be causing an > > > issue? > > > How much traffic are you getting? What's the size of 10 minutes' worth of > > logs? > > we're at maybe avg. 200 request and looks like we're recording 1.25 > events per request, so perhaps 250 log lines / sec? that's in addition > to all the other junk getting spilled out there - i didn't know that > space was limited, there's prob some debug output, then the > exceptions, etc... > > > > > > > > 3.) we switch major versions of the app every time we push, which can > > > be several times each day. this doesn't make sense as an issue since > > > the numbers are know to be wrong over periods where there have been no > > > version changes, but i wanted to mention it. > > > > 4.) can you suggest a better solution for getting data to offline > > > processing? right now we getting the correct numbers using the async > > > http requests without ever calling get_result() or the like as a 'fire- > > > and-forget' http hit (not even sure if we're supposed to use it like > > > this, but seems to work). however, this approach has serious > > > drawbacks: > > > You could log to the datastore, and read and delete old entries using > > remote_api. > > this just doesn't seem like the right job for the datastore - we're > only inserting at 250 events / sec right now, but need to be able to > scale that. if we're inserting a few thousand events per second, and > can only fetch or delete 1K at a time, that seems like a potential > problem. we can batch up the events in each request, but still only > limits the inserts per second to the app's requests per second, which > can have the same issue. just doesn't sound fun. > > maybe http requests are the best way to do it for now, unless we can > get the log space significantly expanded or find another solution. > > > > > > a.) http requests are very slow and expensive for something that does > > > not need to happen immediately. > > > > b.) if the metrics system endpoint becomes unavailable, then, at best, > > > the data gets lost. at worst the issue domino's back up into the http > > > servers and takes the app down as well. (each http request has to > > > timeout, which takes significantly longer, spiking the concurrent > > > connections. this has screwed me multiple times. maybe you google guys > > > mark an endpoint as down system-wide so that subsequent requests never > > > attempt the connection, but we were never that smart). > > > > thanks in advance, neil. > > > -- > > Nick Johnson, Developer Programs Engineer, App Engine > > Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: > > 368047 > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~----------~----~----~----~------~----~------~--~---
