Hi Roy,

On Thu, Jan 27, 2011 at 02:51:37PM -0500, Roy Smith wrote:
> > Try to think about these cases :
> >  - what to do with reqids that are already present in requests
> 
> I don't see how that would happen.  Maybe if we did something like stack one
> haproxy behind another, but I don't see any reason we would do that.

Maybe you have a simple enough setup. I know places where you can pass through
between 5 and 6 haproxies along a whole chain, simply because application
components are chained and each stage includes a load balacing feature.

>  But, if it were to happen somehow, I think we would just leave it untouched
> (and log that we saw it).

That's just the most common thing to do for the inner instances, but the outer
one needs an easy way to strip it, otherwise external users can inject the ID
they want into your system.

>  The goal is to have a unique identifier so that every process that's
> involved in responding to a request can log the id, allowing us to correlate
> logs. If the incoming request already has such an id, there's no reason to
> change it.

Yes you have, see above ;-)

> > - what to do with reqids in responses
> >    => compare them with the request's, block if they do not match
> >    => delete them or not depending on where you're responding
> 
> I'm not sure I understand what you're asking.

Once you deploy unique IDs, it's common to seek for better application
integration and have deeper application components return the ID they
received in the responses. That way, the outer component can compare
the ID it added with the ID it received in the response and ensure that
there was no session crossing in the whole chain, as it unfortunately
happens from time to time with buggy applications or components (mainly
in threaded environments).

> >  - what to log
> >    => do we always need to log a full reqid or can we sometimes just
> >       log one part of it
> 
> What do you mean by logging "one part" of an ID?  The ID is just a unique 
> tag.  I don't understand how it can be divided into parts.

An ID can only be unique in a limited space * time. When you have multiple
processes running on the same machine, a per-process counter is not enough
anymore so you need to discriminate on the process too, otherwise you end
up generating multiple identical "unique IDs". Then you add some machines
and you repeat the same process. Then you take into account the risk of
rollover of the values and you have to add a timestamp.

After about 10 years of feedback using unique IDs, I can say that some
features are definitely important :

  - having a timestamp allows you to easily sort your events and correlate
    them by time. It also indicates you where to look

  - having some origin information (whether it's the instance which received
    the first event or the source address itself) helps a lot correlate logs
    when some are missing. Logs are *always* missing when you want to
    correlate large amounts. You discover that one FS got full or that one
    syslog server was being restarted, or simply that you're dropping a few
    of them on the wire or in system queues, etc... When you can identify
    *where* the ID was created, you can reconstitute the missing parts of
    the chain (assuming you're not missing too many, of course).

  - having some source information generally helps quickly search for other
    occurrences of a similar suspicious event at places where its hard to
    log source information. However, it's far from being a requirement, as
    there are always alternatives. It's just that it help a lot.

  - having the ability to certify with good enough confidence that you're
    not misinterpreting the IDs and that it's not possible that a different
    event caused it. That's very important when you're bringing your logs
    to authorities. You don't want to make someone go to jail for someone
    else.

The minimum requirement I can identify for an haproxy-based ID to be unique
would include :
  - host ID (can be hostname)
  - system PID
  - timestamp
  - counter

The counter must be large enough so as not to roll over within a single
timestamp value. The host ID must be modulated by containers/zones/VMs/etc
if any are present. That's why it's often easier to split it again in two
parts, one being an environment ID or instance ID, which can be configured,
and another one being the system's host name which can optionally be
configured.

Systems I have been using involved source and destination too for
convenience, but that's not absolutely needed, and they don't reduce
much the minimum size of the counter.

Given that I have already managed to make a single instance process slightly
more than 2 million requests per second (pipelined, and extremely short), it
means that a 21-bit counter could be made to wrap around in one second in the
context of an attack. Reasoning with future possibilities, we can easily see
that 24 bits per second are not too much to support what could be done in a
few years.

Some organizations need to keep logs for 3, 5 and sometimes 10 years (I'm not
aware of more than that). 10 years is 315M seconds or 29 bits. So we need
29+24 bits split between timestamp and counter. Using two 32 bit entities,
one with the unix time and one with a 32 bit counter is handy and makes sense.

The system pid has to cover both usages with nbproc > 1 (which could be done
with a relative ID) and independant parallel tasks, which really require the
system-global pid to discriminate them. While most systems are/were using
16-bit pids, things are evolving and 32-bit bits have been available for quite
some time now. Since this part rarely changes, it might make sense to have the
ability to configure its length. Also, probably that in a few years we'll
support threads and will want to make a distinction between multiple threads
of the same process (though it's also possible that having multiple threads
share a same reqID generator could be OK if there aren't too many cores).
Maybe we could use a system-global thread ID instead of the process ID too.

Even with that, we're already at 3*32 bits + system ID, so as you see, it's
not just a simple counter even if the simple counter can fit some uses. And
I'd rather have people spot other possible discriminators before we code than
after. If we identify too many variables maybe we'll have to make the format
user-configurable.

Regards,
Willy


Reply via email to