Hi Neil, Sorry for the delay responding. Responses inline.
On Sat, Sep 26, 2009 at 1:40 PM, neil souza <[email protected]> wrote: > > the issue: it looks like we may not be getting all of our log entries > when when pull the logs from app engine. > > first, a little context. there's a lot here, so bear with me. > > we need to record event lines for metrics. normally, we would write > the lines to a local file on each app server and then pull those logs > every few minutes from the metrics system. we found this to be the > most stable and scalable architecture. > > however, in app engine land, we can't write to a file. so we wrote the > event lines to the logs, set up a script to pull them in 10 minute > intervals, and loaded them into the stats system. > > to be clear, the process goes like this: > > 1.) an event happens on the server that we'd like to record. we write > a line to the log using logging.info(...) in python > > 2.) every 10 minutes, a job starts on a metrics server, which requests > the next batch of logs by calling appcfg.py. the last log in the new > batch is kept in a append file to use as the 'sentinel' for the next > fetch. > > 3.) the new log file is parsed for event lines, which are written to > another 'event' file. > > 4.) a few minutes later, another job grabs new event files and loads > the events into the metrics system. > > everything seemed to work great. until we realized that we were > missing events. a lot of them. between 20-50%. > > there are some events that need to be shared with other systems. for > one if those event types, i was feeling lazy, so i just fired http > hits at the other system as the event happen. at some point, we > compared these numbers - and found them to be drastically different. > > i ran tests today comparing the number of events recorded 'through' > the logs system and the same events recorded by http hit during > runtime. the percent of 'missing' events ranged from 18-56%, and the > percent missing appeared to be significantly higher when the frequency > of events was higher (during peak). > > i've done a significant amount of work that points to the logs being > missing by the point that appcfg.py records them. i've reasonably > verified that all the event lines that appcfg.py pulls down make it > into the metrics system. oh, and all the numbers are being run on > unique user counts, so there's no way that the counts could be > mistakenly large (accidentally reading an event twice does not produce > a new unique user id). > > my questions / issues: > > 1.) should we be getting all of our logged lines from appcfg.py's > request_logs command? is this a known behavior? recall that we are > missing 20-50% of events - this is not a small discrepancy. > App Engine has a fixed amount of space available for logs; it's essentially a circular buffer. When it runs out of space, it starts replacing older logs. > > 2.) we're pulling our logs every 10 minutes. seeing as the > request_logs command lets you specify the time you want in days, i > imagine this as more frequent than intended. could this be causing an > issue? > How much traffic are you getting? What's the size of 10 minutes' worth of logs? > > 3.) we switch major versions of the app every time we push, which can > be several times each day. this doesn't make sense as an issue since > the numbers are know to be wrong over periods where there have been no > version changes, but i wanted to mention it. > > 4.) can you suggest a better solution for getting data to offline > processing? right now we getting the correct numbers using the async > http requests without ever calling get_result() or the like as a 'fire- > and-forget' http hit (not even sure if we're supposed to use it like > this, but seems to work). however, this approach has serious > drawbacks: > You could log to the datastore, and read and delete old entries using remote_api. > > a.) http requests are very slow and expensive for something that does > not need to happen immediately. > > b.) if the metrics system endpoint becomes unavailable, then, at best, > the data gets lost. at worst the issue domino's back up into the http > servers and takes the app down as well. (each http request has to > timeout, which takes significantly longer, spiking the concurrent > connections. this has screwed me multiple times. maybe you google guys > mark an endpoint as down system-wide so that subsequent requests never > attempt the connection, but we were never that smart). > > thanks in advance, neil. > > > -- Nick Johnson, Developer Programs Engineer, App Engine Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047 --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~----------~----~----~----~------~----~------~--~---
