the issue: it looks like we may not be getting all of our log entries when when pull the logs from app engine.
first, a little context. there's a lot here, so bear with me. we need to record event lines for metrics. normally, we would write the lines to a local file on each app server and then pull those logs every few minutes from the metrics system. we found this to be the most stable and scalable architecture. however, in app engine land, we can't write to a file. so we wrote the event lines to the logs, set up a script to pull them in 10 minute intervals, and loaded them into the stats system. to be clear, the process goes like this: 1.) an event happens on the server that we'd like to record. we write a line to the log using logging.info(...) in python 2.) every 10 minutes, a job starts on a metrics server, which requests the next batch of logs by calling appcfg.py. the last log in the new batch is kept in a append file to use as the 'sentinel' for the next fetch. 3.) the new log file is parsed for event lines, which are written to another 'event' file. 4.) a few minutes later, another job grabs new event files and loads the events into the metrics system. everything seemed to work great. until we realized that we were missing events. a lot of them. between 20-50%. there are some events that need to be shared with other systems. for one if those event types, i was feeling lazy, so i just fired http hits at the other system as the event happen. at some point, we compared these numbers - and found them to be drastically different. i ran tests today comparing the number of events recorded 'through' the logs system and the same events recorded by http hit during runtime. the percent of 'missing' events ranged from 18-56%, and the percent missing appeared to be significantly higher when the frequency of events was higher (during peak). i've done a significant amount of work that points to the logs being missing by the point that appcfg.py records them. i've reasonably verified that all the event lines that appcfg.py pulls down make it into the metrics system. oh, and all the numbers are being run on unique user counts, so there's no way that the counts could be mistakenly large (accidentally reading an event twice does not produce a new unique user id). my questions / issues: 1.) should we be getting all of our logged lines from appcfg.py's request_logs command? is this a known behavior? recall that we are missing 20-50% of events - this is not a small discrepancy. 2.) we're pulling our logs every 10 minutes. seeing as the request_logs command lets you specify the time you want in days, i imagine this as more frequent than intended. could this be causing an issue? 3.) we switch major versions of the app every time we push, which can be several times each day. this doesn't make sense as an issue since the numbers are know to be wrong over periods where there have been no version changes, but i wanted to mention it. 4.) can you suggest a better solution for getting data to offline processing? right now we getting the correct numbers using the async http requests without ever calling get_result() or the like as a 'fire- and-forget' http hit (not even sure if we're supposed to use it like this, but seems to work). however, this approach has serious drawbacks: a.) http requests are very slow and expensive for something that does not need to happen immediately. b.) if the metrics system endpoint becomes unavailable, then, at best, the data gets lost. at worst the issue domino's back up into the http servers and takes the app down as well. (each http request has to timeout, which takes significantly longer, spiking the concurrent connections. this has screwed me multiple times. maybe you google guys mark an endpoint as down system-wide so that subsequent requests never attempt the connection, but we were never that smart). thanks in advance, neil. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~----------~----~----~----~------~----~------~--~---
