Bob La Quey wrote:
Assuming you (a) have a customer who knows what they want,

Of course you do not have a customer who knows what they want.
That is one reason why definition is hard and requires a lot
of face to face.

Oh, just to clarify the sort of thing I meant when I said the "customer" might not know what they want, and if they do they might not be able to tell you:

In my current job, the original spec went something like this. (Mildly vague because it's still a couple weeks before we're live and public.)

We're going to start out in 3 cities, and maybe eventually 300 cities or so, but maybe not. In each city, we're going to have somewhere between 5 and 150 high-speed data streams, and an equal number of low-speed data streams. The data streams come in over a custom PCI card, which isn't built yet, but we're pretty sure it'll have drivers available for at least one of the operating systems we know runs on the used hardware we'll be buying once all the deals are put together. The high-speed data streams need to be processed by a piece of software from a German company we're still negotiating with, so we don't know if it's a library, a server, or what. The low-speed data streams each have a format which must be reverse-engineered from samples - i.e., you must guess what it is. In addition, it'll change at times, and sometimes the people providing the stream will simply change it for their amusement, inserting random commentary in the middle of the data.

The library that processes the streams comes from a company full of physicists, so they don't understand that a SIGSEGV is almost always the fault of the C++ code they wrote; instead, they'll insist it must be because we have multiple Linux kernels installed on the disk, even tho the software merely does complex math on data files. There's no logging in their server, so they can't track down their crashes, so they simply cure the problem by ignoring all signals, conveniently making it impossible to cleanly shut down their server once it starts. And it takes several minutes to come back up (which of course one couldn't tell before getting the software and writing enough of the system to fill it up). So we'll want to run two of each server, and insert records into each, except that their client library, while asking you to identify which server you want to use each time, ignores it every time except the first time. They have no version control, so if you find a bug in a month-old version you've tuned your software to work well with, they cannot give you the same version with the bug fixed, and you have to take the latest version, which often has a completely different set of client interfaces. They take a variety of arguments on the command line, but fault out if any don't match what they recommended, so any change in what you want to do with their server requires a round-trip engineering change, recompilation in Germany, upgrading to the new version of the client APIs, etc. They are, unfortunately, the only company in the world that does what they do.

Then the customers connect in over one network owned by S, delivering a large bolus of data each time. The company providing that network (S) has insufficient engineering resources, so they simply ignore bug reports, refuse to make any sorts of changes you've already paid for, and in general do things like require invalid XML as the interface into their system and use SOAP libraries that are specifically coded to only talk to the same SOAP library, rather than actually following the standard. S describes the data they will provide for us from each customer. They lie, and admit they are unable to provide any of the metadata they promised, after the system is up and running enough to accept their data and we realize they're providing all zeros in those fields, making us have to replicate the work for each customer at each city (3, or 300) instead of doing it only in the city where the customer is. They are, unfortunately, the only company in the world that does what they do.

The boss wants to support 10 customer requests per second, which you can fit over a T1, so you design the system around that, until three months later the boss wants to support 400 customer requests per second because he made a deal with someone to get the funding you need to build it in the first place. This requires moving parts of the system next to the NOC of the "S network" from the previous paragraph.

The customer requests get matched against the data streams from the cities, which might require sending the request out to the city or might require piping all the city streams back to the customer-serving computer, depending on (1) how much the German software compresses data, (2) how many customers are calling in, and (3) how many cities have your custom PCI cards in them.

We also have a data stream from company M, which altho in XML, company M is unable to document what any of the attributes or tags actually mean, beyond that implied by the names of the tags. XML in this case is simply used as a mechanism to avoid having to document how to read the data. The library to access it is opaque. It also requires providing a buffer size, but M doesn't put any upper limit on the size of the buffer. The client library does not release the memory it uses, so you have to run for a bit, exit, and restart. About five times a day, one of the records will be corrupt and unparsable XML. The records include timestamps but no timezone data. Occasionally, the connection to company M will get randomly deconfigured, and your connection to M will be off the air for a while while they get around to resetting it, not that this was part of M's documentation of course. Company M is, unfortunately, the only company in the world that does what they do.

We also get some *third* types of low-speed data streams from various places in various cities. Those streams are also fairly willy-nilly, in that while the data is the best available, none of it is actually authoritative. Like on this particular stream, between 6AM and 9AM weekdays, the data should be ignored and company M's data should be used instead, except when the data starts with a Z. Others will be like that too, but we don't know which until we get the streams in and see. The data is always timely, unless it's not, but you can't tell that. And the actual content of the data differs from that provided by M, so you can't really match it up in any good way. Usually. Oh, and each of these streams is a different partner with specific deals and advertising models and so on, so they'll want to be able to customize things throughout the system when it's their stream we're using, but we don't know which kinds of things. We'll let you know what we need to do after we sign the contracts promising to do it. By the way, make sure the software that's accepting the connections every few seconds for these low-speed data streams never breaks, because it *takes out your partner's servers* if they can't connect to you after they've configured that connection, and your partner goes off the air until you come back online. Not that you know this until a power failure dies in such a way as to trip the breaker on the rack's UPS and powers down all your servers. Remember to include that in your spec.

We match up the real-time customer data coming in from S's network to the real-time data coming in on the high-speed data streams, correlate with the somewhat-real-time low-speed data streams, and generate a web page with the results. The contents of the web page have to be approved by all of the other customers of S, in advance, with whatever changes they want, while live and available to the public, before S will allow us to use S's network, in spite of the fact that S's network has nothing to do with the Internet, doesn't carry internet traffic, and S's other customers don't carry internet traffic.

Whether the matching of data occurs in the individual cities or in the central servers depends on whether it's less bandwidth to ship the customer data to the cities or the city data to the customer server. Of course, since we don't know how much data the German software generates nor how many customers we'll be processing within a couple orders of magnitude, we'll have to decide that a few months from now. Make sure you order the right number of T1's, which also take a month or two to install. We'll let you know where we're putting the servers after we sign the leases, at which point we'll be able to tell you how many high-speed data streams are available at that location.

By the way, the matching is also heuristic. About 40% of the time, you'll get 10 different matches, all with zero confidence. About 10% of the time, you'll get multiple 100% confidence matches of conflicting information. Make sure to figure out in advance of outsourcing what algorithm will give you good results in spite of such results, which you didn't know you'd have, and which the Germans were surprised you got, and more surprised you figured out how to work around.

Oh, by the way, the web-services database you're contractually obligated to fall back to when you don't get any matches at all? It's mostly answers your customers are sure not to be interested in, so that'll give ten bogus matches of zero confidence also, except that about 30% of the time the right match is in the list one or more times. So by the way, three weeks before launch, we decide we need to build our own database to match against too. In spite of not having any actual authoritative data to build it with. Do take care of that, won't you? Make sure you have at least two weeks of data in it when we launch. Note that we don't have any machines to actually run this database on, as they're all repurposed to things like the backup servers for when the German software crashes and we have to use the hot spare while the primary comes back up. All that's left are four machines which don't see the custom cards, two with disks too small to hold an hour worth of data, and two with huge disks but (for yet-unknown reasons) will only run one disk at about 5% of the speed it's capable of, in spite of buying a new controller card and new disks.

Then we also offer links to "affiliates" on those web pages, in order that we get paid for all this. Some of the links are obtained by downloading and processing a 6-gig CSV data file which may or may not have changed, which gets updated once a week but which the provider contractually requires you download daily, via FTP, to an FTP server which only serves files which do not appear in directory listings and not those files which *do* appear in their directory listings, ensuring you cannot look at the timestamp or size on the file to see if it has changed. (And if you try to download the file that *does* appear in the directory listing, they lock out your account, for reasons still unknown.) Since you have to download the entire file every time, remember to spend the 3 hours of compute time (after the 5 hours of downloading) to figure out which rows have changed, without actually impacting the processing of the system, said system which hasn't been built yet, so you don't know the performance of. (Note, incidentally, that 6G is a fairly small lump of data for this system. I not uncommonly wind up with individual files I process that take >30 seconds to delete.)

Other affiliates are equally f'ed in other undocumented ways, but each of them is the only company in the world that provides the data they do. By the way, which affiliates we present to the customer depends on the browser and ISP the customer is using, and sometimes the physical location of the customer. We don't have the data that does that matching, but we might be able to get it from someone sometimes, and perhaps volunteers will keep it up to date for us. Or we can guess that the customer is probably in the same city as the data he most recently matched against, maybe, if we can tell what that is. Unless he never talked to us before.

We use multiple mechanisms to reach out to customers, and the mechanism that works is dependent on the service provider of the customer. We don't always know the service provider (as some customers we talk to before they ever try to contact us), and sometimes when they connect in, our provider will lie about what service provider the customer is using if they recently changed service providers. We also don't know which mechanism works on which service provider, because the service providers actively try to keep you from reaching your customers, and the various mechanisms try to work around the blocks in various ways.

Make sure, by the way, that you spec out for outsourcing exactly how you implement the processing for each affiliate, given that the system will be live with customers before the contracts with the affiliates are signed, meaning the affiliates won't disclose their super-secret APIs.

Oh, and S also provides data, indexed with yet another non-authoritative string. The strings used for indexing are in a variety of character sets, but S doesn't bother to actually make note of which records use which character sets, so pretty much anything but ASCII is f'ed anyway. Hope that something doesn't come in corrupting your database. Assuming you can get updates from them, which randomly stop working until you work your way up the management ladder to where they finally admit that since you prepaid them a large lump sum, maybe they should assign a programmer to figuring out why their services randomly stop working for a day or two at a time. Make sure you remember to spec that your updates have to work in spite of that.

And S, upon hearing you've finished coding to the APIs they provided a couple months ago, decide that the person who provided the APIs should have provided different APIs, and insist you rework the message flow through the entire system, because they're contractually obligated to know what happened with the customer asking for access to their database and be able to audit and prove the customer did or did not get certain messages from our system. Which they didn't mention until that part of the system was already written, tested, and in production.

By the way, we don't know how long it takes for the German software to do any processing. Nor do we know how timely M's data is. But we need to provide the answer to the user within a minute, or it's not a viable business model. Does that make a difference? How many computers will we need? Because we need to get rack space and power two months in advance.

Now, do me a favor, and write down a complete and accurate spec for how to build that system, so I can outsource it. Oh, and let me know how long it'll take before you start coding. And how many computers, because we'll get the funding to actually sign contracts with all these entities once we know how much it'll cost to build. We'll schedule our nationwide advertising blitz to coincide with your estimates. Thanks!

Note that I'm not exaggerating any of this.

Note that this is also not a cut at my bosses or anything like that. It's just a complicated system being put together with a variety of constraints that are difficult to work around. The bosses are great, and very understanding, and doing an outstanding job with their part of the responsibilities for making this a success.

*** *** ***

When you have a nice self-contained build-from-scratch system, it's not too hard to come up with a spec, then implement to it. That's what I do all the time, in that situation. One might even believe outsourcing in such a situation might be worthwhile, if you could figure out which outsourcing companies were reliable enough to actually implement what you spec'ed.

When you have a full-blown system whose value is in putting together a bunch of pieces that nobody ever used that way before, it's more than a bit harder.

--
  Darren New / San Diego, CA, USA (PST)
    It's not feature creep if you put it
    at the end and adjust the release date.

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Reply via email to