Hi Neil,

Sorry for the delay responding. Responses inline.

On Sat, Sep 26, 2009 at 1:40 PM, neil souza <[email protected]> wrote:

>
> the issue: it looks like we may not be getting all of our log entries
> when when pull the logs from app engine.
>
> first, a little context. there's a lot here, so bear with me.
>
> we need to record event lines for metrics. normally, we would write
> the lines to a local file on each app server and then pull those logs
> every few minutes from the metrics system. we found this to be the
> most stable and scalable architecture.
>
> however, in app engine land, we can't write to a file. so we wrote the
> event lines to the logs, set up a script to pull them in 10 minute
> intervals, and loaded them into the stats system.
>
> to be clear, the process goes like this:
>
> 1.) an event happens on the server that we'd like to record. we write
> a line to the log using logging.info(...) in python
>
> 2.) every 10 minutes, a job starts on a metrics server, which requests
> the next batch of logs by calling appcfg.py. the last log in the new
> batch is kept in a append file to use as the 'sentinel' for the next
> fetch.
>
> 3.) the new log file is parsed for event lines, which are written to
> another 'event' file.
>
> 4.) a few minutes later, another job grabs new event files and loads
> the events into the metrics system.
>
> everything seemed to work great. until we realized that we were
> missing events. a lot of them. between 20-50%.
>
> there are some events that need to be shared with other systems. for
> one if those event types, i was feeling lazy, so i just fired http
> hits at the other system as the event happen. at some point, we
> compared these numbers - and found them to be drastically different.
>
> i ran tests today comparing the number of events recorded 'through'
> the logs system and the same events recorded by http hit during
> runtime. the percent of 'missing' events ranged from 18-56%, and the
> percent missing appeared to be significantly higher when the frequency
> of events was higher (during peak).
>
> i've done a significant amount of work that points to the logs being
> missing by the point that appcfg.py records them. i've reasonably
> verified that all the event lines that appcfg.py pulls down make it
> into the metrics system. oh, and all the numbers are being run on
> unique user counts, so there's no way that the counts could be
> mistakenly large (accidentally reading an event twice does not produce
> a new unique user id).
>
> my questions / issues:
>
> 1.) should we be getting all of our logged lines from appcfg.py's
> request_logs command? is this a known behavior? recall that we are
> missing 20-50% of events - this is not a small discrepancy.
>

App Engine has a fixed amount of space available for logs; it's essentially
a circular buffer. When it runs out of space, it starts replacing older
logs.


>
> 2.) we're pulling our logs every 10 minutes. seeing as the
> request_logs command lets you specify the time you want in days, i
> imagine this as more frequent than intended. could this be causing an
> issue?
>

How much traffic are you getting? What's the size of 10 minutes' worth of
logs?


>
> 3.) we switch major versions of the app every time we push, which can
> be several times each day. this doesn't make sense as an issue since
> the numbers are know to be wrong over periods where there have been no
> version changes, but i wanted to mention it.
>
> 4.) can you suggest a better solution for getting data to offline
> processing? right now we getting the correct numbers using the async
> http requests without ever calling get_result() or the like as a 'fire-
> and-forget' http hit (not even sure if we're supposed to use it like
> this, but seems to work). however, this approach has serious
> drawbacks:
>

You could log to the datastore, and read and delete old entries using
remote_api.


>
> a.) http requests are very slow and expensive for something that does
> not need to happen immediately.
>
> b.) if the metrics system endpoint becomes unavailable, then, at best,
> the data gets lost. at worst the issue domino's back up into the http
> servers and takes the app down as well. (each http request has to
> timeout, which takes significantly longer, spiking the concurrent
> connections. this has screwed me multiple times. maybe you google guys
> mark an endpoint as down system-wide so that subsequent requests never
> attempt the connection, but we were never that smart).
>
> thanks in advance, neil.
> >
>


-- 
Nick Johnson, Developer Programs Engineer, App Engine
Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number:
368047

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to