...whoops, not the "logger" module, the "logging" module. I write a lot of Ruby too. Remembering which direction my stack traces are going, and whether it's logger or logging is something I will probably never achieve.
On Oct 8, 10:36 pm, Jason Smith <[email protected]> wrote: > (What is the etiquette for replying to mixed top-post/bottom-post > threads? Google Wave to the rescue!?!) > > I am also suffering from the logging circular buffer. Part of me feels > like if a customer is paying triple-digit dollars per day to Google he > should get more than a few gigs oflogs! But to discuss a technical > solution... > > Part a suggestion from my experience, and part a request from any > Googlers: What I did was to subclass RequestHandler into > LoggingRequestHandler (the site is a webapp app) which would create > its own logger object and set the default log level to warn. Next, all > logging happens through that logger instead of the root one in the > logger module. Now I have no debug or infologsbecause they will be > re-enabled only under special circumstances. To re-activate them, the > LoggingRequestHandler pulls a very simple serialized dict from > basically a singleton entity in the DB (or memcache), of the following > format: > > { > "Name 1 (usually an issue id from our tracker)": [ > ["HandlerClass1", "http_method", "get_or_post_var", "trigger > value"], > ["HandlerClass2", "http_method", "get_or_post_var", "trigger > value"] > ], > > # More concrete example below > "Issue 12345": [ > ["GetUsers", "get", "username", "[email protected]"], > ["PostUser", "post", "username", "[email protected]"] > ] > > } > > LoggingRequestHandler wraps self.get(), self.put(), etc., first > checking whether its own name, the HTTP method, query parameters and > values match those listed in the data structure. Those strings are > treated as regexes, and there are some aliased "variables" like > "__url__" which allows me to match against the entire URL. If any of > the conditions match, the logging object's level is set to DEBUG. > Performance impact is negligible, and the main problem is that we now > have to make a round-trip to users to tell them "logging is activated" > and to reproduce their problem. > > Now to my question. I am suspicious that all the work I did to > implement this is for nothing, because ourlogsare still rotating out > into the nothing after only a day. Is this because the request volume > is still so high (80 request/sec, the vast majority with nologs)? > Back in the bad old days we were logging over 1KB of data per request > and I thought it was reasonable; but my expectation of havinglogs > (all of which I'm interested in by the way) linger on for many days > turned out to be totally wrong. If anybody has any advice about this I > would love to hear it. > > In the meantime, I think I will have to do what I probably should have > done from the start: regularly fetch thelogsto a separate system and > search through them there. > > On Oct 6, 4:47 pm, "Nick Johnson (Google)" <[email protected]> > wrote: > > > Hi Neil, > > > On Tue, Oct 6, 2009 at 1:00 AM, neil souza <[email protected]> wrote: > > > > thanks nick, responses in line (hope they come through right, i don't > > > really know how to use groups) > > > > On Sep 30, 2:57 am, "Nick Johnson (Google)" <[email protected]> > > > wrote: > > > > Hi Neil, > > > > > Sorry for the delay responding. Responses inline. > > > > > On Sat, Sep 26, 2009 at 1:40 PM, neil souza <[email protected]> > > > wrote: > > > > > > the issue: it looks like we may not be getting all of our log entries > > > > > when when pull thelogsfrom app engine. > > > > > > first, a little context. there's a lot here, so bear with me. > > > > > > we need to record event lines for metrics. normally, we would write > > > > > the lines to a local file on each app server and then pull thoselogs > > > > > every few minutes from the metrics system. we found this to be the > > > > > most stable and scalable architecture. > > > > > > however, in app engine land, we can't write to a file. so we wrote the > > > > > event lines to thelogs, set up a script to pull them in 10 minute > > > > > intervals, and loaded them into the stats system. > > > > > > to be clear, the process goes like this: > > > > > > 1.) an event happens on the server that we'd like to record. we write > > > > > a line to the log using logging.info(...) in python > > > > > > 2.) every 10 minutes, a job starts on a metrics server, which requests > > > > > the next batch oflogsby calling appcfg.py. the last log in the new > > > > > batch is kept in a append file to use as the 'sentinel' for the next > > > > > fetch. > > > > > > 3.) the new log file is parsed for event lines, which are written to > > > > > another 'event' file. > > > > > > 4.) a few minutes later, another job grabs new event files and loads > > > > > the events into the metrics system. > > > > > > everything seemed to work great. until we realized that we were > > > > > missing events. a lot of them. between 20-50%. > > > > > > there are some events that need to be shared with other systems. for > > > > > one if those event types, i was feeling lazy, so i just fired http > > > > > hits at the other system as the event happen. at some point, we > > > > > compared these numbers - and found them to be drastically different. > > > > > > i ran tests today comparing the number of events recorded 'through' > > > > > thelogssystem and the same events recorded by http hit during > > > > > runtime. the percent of 'missing' events ranged from 18-56%, and the > > > > > percent missing appeared to be significantly higher when the frequency > > > > > of events was higher (during peak). > > > > > > i've done a significant amount of work that points to thelogsbeing > > > > > missing by the point that appcfg.py records them. i've reasonably > > > > > verified that all the event lines that appcfg.py pulls down make it > > > > > into the metrics system. oh, and all the numbers are being run on > > > > > unique user counts, so there's no way that the counts could be > > > > > mistakenly large (accidentally reading an event twice does not produce > > > > > a new unique user id). > > > > > > my questions / issues: > > > > > > 1.) should we be getting all of our logged lines from appcfg.py's > > > > > request_logs command? is this a known behavior? recall that we are > > > > > missing 20-50% of events - this is not a small discrepancy. > > > > > App Engine has a fixed amount of space available forlogs; it's > > > essentially > > > > a circular buffer. When it runs out of space, it starts replacing older > > > >logs. > > > > well, i'm going to guess that's the culprit. > > > > > > 2.) we're pulling ourlogsevery 10 minutes. seeing as the > > > > > request_logs command lets you specify the time you want in days, i > > > > > imagine this as more frequent than intended. could this be causing an > > > > > issue? > > > > > How much traffic are you getting? What's the size of 10 minutes' worth > > > > of > > > >logs? > > > > we're at maybe avg. 200 request and looks like we're recording 1.25 > > > events per request, so perhaps 250 log lines / sec? that's in addition > > > to all the other junk getting spilled out there - i didn't know that > > > space was limited, there's prob some debug output, then the > > > exceptions, etc... > > > There's a limit on the total amount oflogsstored - thelogsare > > effectively stored in a circular buffer, so old data is overwritten as > > needed to make space for new data. > > > > > > 3.) we switch major versions of the app every time we push, which can > > > > > be several times each day. this doesn't make sense as an issue since > > > > > the numbers are know to be wrong over periods where there have been no > > > > > version changes, but i wanted to mention it. > > > > > > 4.) can you suggest a better solution for getting data to offline > > > > > processing? right now we getting the correct numbers using the async > > > > > http requests without ever calling get_result() or the like as a > > > > > 'fire- > > > > > and-forget' http hit (not even sure if we're supposed to use it like > > > > > this, but seems to work). however, this approach has serious > > > > > drawbacks: > > > > > You could log to the datastore, and read and delete old entries using > > > > remote_api. > > > > this just doesn't seem like the right job for the datastore - we're > > > only inserting at 250 events / sec right now, but need to be able to > > > scale that. if we're inserting a few thousand events per second, and > > > can only fetch or delete 1K at a time, that seems like a potential > > > problem. we can batch up the events in each request, but still only > > > limits the inserts per second to the app's requests per second, which > > > can have the same issue. just doesn't sound fun. > > > > maybe http requests are the best way to do it for now, unless we can > > > get the log space significantly expanded or find another solution. > > > If you're doing 250 events a second, HTTP requests are going to be equally > > problematic - you'll be doing 250 of _those_ per second. 1000 events is > > still 4 seconds worth - so instead of doing 250QPS of outgoing traffic, you > > could be doing only 1/4 of a QPS of incoming traffic! > > > -Nick > > > > > > a.) http requests are very slow and expensive for something that does > > > > > not need to happen immediately. > > > > > > b.) if the metrics system endpoint becomes unavailable, then, at best, > > > > > the data gets lost. at worst the issue domino's back up into the http > > > > > servers and takes the app down as well. (each http request has to > > > > > timeout, which takes significantly longer, spiking the concurrent > > > > > connections. this has screwed me multiple times. maybe you google guys > > > > > mark an endpoint as down system-wide so that subsequent requests never > > > > > attempt the connection, but we were never that smart). > > > > > > thanks in advance, neil. > > > > > -- > > > > Nick Johnson, Developer Programs Engineer, App Engine > > > > Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration > > > Number: > > > > 368047 > > > -- > > Nick Johnson, Developer Programs Engineer, App Engine > > Google Ireland Ltd. :: Registered > > ... > > read more » --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~----------~----~----~----~------~----~------~--~---
