[google-appengine] Re: missing logs

Jason Smith Thu, 08 Oct 2009 08:39:54 -0700

...whoops, not the "logger" module, the "logging" module. I write a
lot of Ruby too. Remembering which direction my stack traces are
going, and whether it's logger or logging is something I will probably
never achieve.


On Oct 8, 10:36 pm, Jason  Smith <[email protected]> wrote:
> (What is the etiquette for replying to mixed top-post/bottom-post
> threads? Google Wave to the rescue!?!)
>
> I am also suffering from the logging circular buffer. Part of me feels
> like if a customer is paying triple-digit dollars per day to Google he
> should get more than a few gigs oflogs! But to discuss a technical
> solution...
>
> Part a suggestion from my experience, and part a request from any
> Googlers: What I did was to subclass RequestHandler into
> LoggingRequestHandler (the site is a webapp app) which would create
> its own logger object and set the default log level to warn. Next, all
> logging happens through that logger instead of the root one in the
> logger module. Now I have no debug or infologsbecause they will be
> re-enabled only under special circumstances. To re-activate them, the
> LoggingRequestHandler pulls a very simple serialized dict from
> basically a singleton entity in the DB (or memcache), of the following
> format:
>
> {
>   "Name 1 (usually an issue id from our tracker)": [
>     ["HandlerClass1", "http_method", "get_or_post_var", "trigger
> value"],
>     ["HandlerClass2", "http_method", "get_or_post_var", "trigger
> value"]
>   ],
>
>   # More concrete example below
>   "Issue 12345": [
>     ["GetUsers", "get", "username", "[email protected]"],
>     ["PostUser", "post", "username", "[email protected]"]
>   ]
>
> }
>
> LoggingRequestHandler wraps self.get(), self.put(), etc., first
> checking whether its own name, the HTTP method, query parameters and
> values match those listed in the data structure. Those strings are
> treated as regexes, and there are some aliased "variables" like
> "__url__" which allows me to match against the entire URL. If any of
> the conditions match, the logging object's level is set to DEBUG.
> Performance impact is negligible, and the main problem is that we now
> have to make a round-trip to users to tell them "logging is activated"
> and to reproduce their problem.
>
> Now to my question. I am suspicious that all the work I did to
> implement this is for nothing, because ourlogsare still rotating out
> into the nothing after only a day. Is this because the request volume
> is still so high (80 request/sec, the vast majority with nologs)?
> Back in the bad old days we were logging over 1KB of data per request
> and I thought it was reasonable; but my expectation of havinglogs
> (all of which I'm interested in by the way) linger on for many days
> turned out to be totally wrong. If anybody has any advice about this I
> would love to hear it.
>
> In the meantime, I think I will have to do what I probably should have
> done from the start: regularly fetch thelogsto a separate system and
> search through them there.
>
> On Oct 6, 4:47 pm, "Nick Johnson (Google)" <[email protected]>
> wrote:
>
> > Hi Neil,
>
> > On Tue, Oct 6, 2009 at 1:00 AM, neil souza <[email protected]> wrote:
>
> > > thanks nick, responses in line (hope they come through right, i don't
> > > really know how to use groups)
>
> > > On Sep 30, 2:57 am, "Nick Johnson (Google)" <[email protected]>
> > > wrote:
> > > > Hi Neil,
>
> > > > Sorry for the delay responding. Responses inline.
>
> > > > On Sat, Sep 26, 2009 at 1:40 PM, neil souza <[email protected]>
> > > wrote:
>
> > > > > the issue: it looks like we may not be getting all of our log entries
> > > > > when when pull thelogsfrom app engine.
>
> > > > > first, a little context. there's a lot here, so bear with me.
>
> > > > > we need to record event lines for metrics. normally, we would write
> > > > > the lines to a local file on each app server and then pull thoselogs
> > > > > every few minutes from the metrics system. we found this to be the
> > > > > most stable and scalable architecture.
>
> > > > > however, in app engine land, we can't write to a file. so we wrote the
> > > > > event lines to thelogs, set up a script to pull them in 10 minute
> > > > > intervals, and loaded them into the stats system.
>
> > > > > to be clear, the process goes like this:
>
> > > > > 1.) an event happens on the server that we'd like to record. we write
> > > > > a line to the log using logging.info(...) in python
>
> > > > > 2.) every 10 minutes, a job starts on a metrics server, which requests
> > > > > the next batch oflogsby calling appcfg.py. the last log in the new
> > > > > batch is kept in a append file to use as the 'sentinel' for the next
> > > > > fetch.
>
> > > > > 3.) the new log file is parsed for event lines, which are written to
> > > > > another 'event' file.
>
> > > > > 4.) a few minutes later, another job grabs new event files and loads
> > > > > the events into the metrics system.
>
> > > > > everything seemed to work great. until we realized that we were
> > > > > missing events. a lot of them. between 20-50%.
>
> > > > > there are some events that need to be shared with other systems. for
> > > > > one if those event types, i was feeling lazy, so i just fired http
> > > > > hits at the other system as the event happen. at some point, we
> > > > > compared these numbers - and found them to be drastically different.
>
> > > > > i ran tests today comparing the number of events recorded 'through'
> > > > > thelogssystem and the same events recorded by http hit during
> > > > > runtime. the percent of 'missing' events ranged from 18-56%, and the
> > > > > percent missing appeared to be significantly higher when the frequency
> > > > > of events was higher (during peak).
>
> > > > > i've done a significant amount of work that points to thelogsbeing
> > > > > missing by the point that appcfg.py records them. i've reasonably
> > > > > verified that all the event lines that appcfg.py pulls down make it
> > > > > into the metrics system. oh, and all the numbers are being run on
> > > > > unique user counts, so there's no way that the counts could be
> > > > > mistakenly large (accidentally reading an event twice does not produce
> > > > > a new unique user id).
>
> > > > > my questions / issues:
>
> > > > > 1.) should we be getting all of our logged lines from appcfg.py's
> > > > > request_logs command? is this a known behavior? recall that we are
> > > > > missing 20-50% of events - this is not a small discrepancy.
>
> > > > App Engine has a fixed amount of space available forlogs; it's
> > > essentially
> > > > a circular buffer. When it runs out of space, it starts replacing older
> > > >logs.
>
> > > well, i'm going to guess that's the culprit.
>
> > > > > 2.) we're pulling ourlogsevery 10 minutes. seeing as the
> > > > > request_logs command lets you specify the time you want in days, i
> > > > > imagine this as more frequent than intended. could this be causing an
> > > > > issue?
>
> > > > How much traffic are you getting? What's the size of 10 minutes' worth 
> > > > of
> > > >logs?
>
> > > we're at maybe avg. 200 request and looks like we're recording 1.25
> > > events per request, so perhaps 250 log lines / sec? that's in addition
> > > to all the other junk getting spilled out there - i didn't know that
> > > space was limited, there's prob some debug output, then the
> > > exceptions, etc...
>
> > There's a limit on the total amount oflogsstored - thelogsare
> > effectively stored in a circular buffer, so old data is overwritten as
> > needed to make space for new data.
>
> > > > > 3.) we switch major versions of the app every time we push, which can
> > > > > be several times each day. this doesn't make sense as an issue since
> > > > > the numbers are know to be wrong over periods where there have been no
> > > > > version changes, but i wanted to mention it.
>
> > > > > 4.) can you suggest a better solution for getting data to offline
> > > > > processing? right now we getting the correct numbers using the async
> > > > > http requests without ever calling get_result() or the like as a 
> > > > > 'fire-
> > > > > and-forget' http hit (not even sure if we're supposed to use it like
> > > > > this, but seems to work). however, this approach has serious
> > > > > drawbacks:
>
> > > > You could log to the datastore, and read and delete old entries using
> > > > remote_api.
>
> > > this just doesn't seem like the right job for the datastore - we're
> > > only inserting at 250 events / sec right now, but need to be able to
> > > scale that. if we're inserting a few thousand events per second, and
> > > can only fetch or delete 1K at a time, that seems like a potential
> > > problem. we can batch up the events in each request, but still only
> > > limits the inserts per second to the app's requests per second, which
> > > can have the same issue. just doesn't sound fun.
>
> > > maybe http requests are the best way to do it for now, unless we can
> > > get the log space significantly expanded or find another solution.
>
> > If you're doing 250 events a second, HTTP requests are going to be equally
> > problematic - you'll be doing 250 of _those_ per second. 1000 events is
> > still 4 seconds worth - so instead of doing 250QPS of outgoing traffic, you
> > could be doing only 1/4 of a QPS of incoming traffic!
>
> > -Nick
>
> > > > > a.) http requests are very slow and expensive for something that does
> > > > > not need to happen immediately.
>
> > > > > b.) if the metrics system endpoint becomes unavailable, then, at best,
> > > > > the data gets lost. at worst the issue domino's back up into the http
> > > > > servers and takes the app down as well. (each http request has to
> > > > > timeout, which takes significantly longer, spiking the concurrent
> > > > > connections. this has screwed me multiple times. maybe you google guys
> > > > > mark an endpoint as down system-wide so that subsequent requests never
> > > > > attempt the connection, but we were never that smart).
>
> > > > > thanks in advance, neil.
>
> > > > --
> > > > Nick Johnson, Developer Programs Engineer, App Engine
> > > > Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration
> > > Number:
> > > > 368047
>
> > --
> > Nick Johnson, Developer Programs Engineer, App Engine
> > Google Ireland Ltd. :: Registered
>
> ...
>
> read more »
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: missing logs

Reply via email to