[google-appengine] Re: missing logs

neil souza Mon, 05 Oct 2009 17:00:55 -0700

thanks nick, responses in line (hope they come through right, i don't
really know how to use groups)


On Sep 30, 2:57 am, "Nick Johnson (Google)" <[email protected]>
wrote:
> Hi Neil,
>
> Sorry for the delay responding. Responses inline.
>
>
>
>
>
> On Sat, Sep 26, 2009 at 1:40 PM, neil souza <[email protected]> wrote:
>
> > the issue: it looks like we may not be getting all of our log entries
> > when when pull the logs from app engine.
>
> > first, a little context. there's a lot here, so bear with me.
>
> > we need to record event lines for metrics. normally, we would write
> > the lines to a local file on each app server and then pull those logs
> > every few minutes from the metrics system. we found this to be the
> > most stable and scalable architecture.
>
> > however, in app engine land, we can't write to a file. so we wrote the
> > event lines to the logs, set up a script to pull them in 10 minute
> > intervals, and loaded them into the stats system.
>
> > to be clear, the process goes like this:
>
> > 1.) an event happens on the server that we'd like to record. we write
> > a line to the log using logging.info(...) in python
>
> > 2.) every 10 minutes, a job starts on a metrics server, which requests
> > the next batch of logs by calling appcfg.py. the last log in the new
> > batch is kept in a append file to use as the 'sentinel' for the next
> > fetch.
>
> > 3.) the new log file is parsed for event lines, which are written to
> > another 'event' file.
>
> > 4.) a few minutes later, another job grabs new event files and loads
> > the events into the metrics system.
>
> > everything seemed to work great. until we realized that we were
> > missing events. a lot of them. between 20-50%.
>
> > there are some events that need to be shared with other systems. for
> > one if those event types, i was feeling lazy, so i just fired http
> > hits at the other system as the event happen. at some point, we
> > compared these numbers - and found them to be drastically different.
>
> > i ran tests today comparing the number of events recorded 'through'
> > the logs system and the same events recorded by http hit during
> > runtime. the percent of 'missing' events ranged from 18-56%, and the
> > percent missing appeared to be significantly higher when the frequency
> > of events was higher (during peak).
>
> > i've done a significant amount of work that points to the logs being
> > missing by the point that appcfg.py records them. i've reasonably
> > verified that all the event lines that appcfg.py pulls down make it
> > into the metrics system. oh, and all the numbers are being run on
> > unique user counts, so there's no way that the counts could be
> > mistakenly large (accidentally reading an event twice does not produce
> > a new unique user id).
>
> > my questions / issues:
>
> > 1.) should we be getting all of our logged lines from appcfg.py's
> > request_logs command? is this a known behavior? recall that we are
> > missing 20-50% of events - this is not a small discrepancy.
>
> App Engine has a fixed amount of space available for logs; it's essentially
> a circular buffer. When it runs out of space, it starts replacing older
> logs.
>
well, i'm going to guess that's the culprit.
>
>
> > 2.) we're pulling our logs every 10 minutes. seeing as the
> > request_logs command lets you specify the time you want in days, i
> > imagine this as more frequent than intended. could this be causing an
> > issue?
>
> How much traffic are you getting? What's the size of 10 minutes' worth of
> logs?
>
we're at maybe avg. 200 request and looks like we're recording 1.25
events per request, so perhaps 250 log lines / sec? that's in addition
to all the other junk getting spilled out there - i didn't know that
space was limited, there's prob some debug output, then the
exceptions, etc...
>
>
> > 3.) we switch major versions of the app every time we push, which can
> > be several times each day. this doesn't make sense as an issue since
> > the numbers are know to be wrong over periods where there have been no
> > version changes, but i wanted to mention it.
>
> > 4.) can you suggest a better solution for getting data to offline
> > processing? right now we getting the correct numbers using the async
> > http requests without ever calling get_result() or the like as a 'fire-
> > and-forget' http hit (not even sure if we're supposed to use it like
> > this, but seems to work). however, this approach has serious
> > drawbacks:
>
> You could log to the datastore, and read and delete old entries using
> remote_api.
>

this just doesn't seem like the right job for the datastore - we're
only inserting at 250 events / sec right now, but need to be able to
scale that. if we're inserting a few thousand events per second, and
can only fetch or delete 1K at a time, that seems like a potential
problem. we can batch up the events in each request, but still only
limits the inserts per second to the app's requests per second, which
can have the same issue. just doesn't sound fun.

maybe http requests are the best way to do it for now, unless we can
get the log space significantly expanded or find another solution.

>
>
> > a.) http requests are very slow and expensive for something that does
> > not need to happen immediately.
>
> > b.) if the metrics system endpoint becomes unavailable, then, at best,
> > the data gets lost. at worst the issue domino's back up into the http
> > servers and takes the app down as well. (each http request has to
> > timeout, which takes significantly longer, spiking the concurrent
> > connections. this has screwed me multiple times. maybe you google guys
> > mark an endpoint as down system-wide so that subsequent requests never
> > attempt the connection, but we were never that smart).
>
> > thanks in advance, neil.
>
> --
> Nick Johnson, Developer Programs Engineer, App Engine
> Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number:
> 368047
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: missing logs

Reply via email to