These are issues I (and many others) have grappled with for many years. I 
have strong opinions that deftly straddle both sides. So - I can't be wrong! 
To address the points Mikhail raised, I'll use the context of using data to 
predict sales.

1) "we assume that our data reflect adequately business issues (customer 
behavior) "

The question here is what is "adequately", and what is "customer behavior". 
Defining these precisely is very important to developing an accurate, useful 
prediction system. Understanding what is "adequate" is tough. For the 
client, it initially means "better than what I do now." Later, it evolves 
into something like "Error = 5 - 10 %."  For sales prediction, this is an 
impossible standard. So in the end, the client will be unhappy!

Several problems cause this. First, the customers are not homogeneous. 
Different groups behave differently to the same stimuli. And the groupings 
you can develop of similarly behaving customers for one product is not the 
same as for another. I.e., knowing how a customer responds to a Coke 
promotion doesn't necessarily tell you how he/she will respond to a Tide 
promotion. Second, you don't always have the most important data you need. 
Normally for sales, you will have price and volume data for the item of 
interest and competitors (identifying competitors is another problem ...). 
But many important data pieces that have major effects on sales (or stock 
prices, inventory levels, etc.) are not what I call "observable" in the data 
the client can give you. This "unobservable" data can include a major sale 
on the item by the WalMart across the street from a store, a major snowstorm 
that keeps people out of the stores, errors in the shelf price tag, 
stockouts in the distribution chain, local population changes due to 
holidays, etc. While sometimes this "unobservable" data can be gotten, it 
takes a lot of work and is very expensive. Third, even though you may have 
what you think is lots of data (typical retail data sets hold tens of 
billions of transactions), it isn't enough! By the time you develop a model 
you think has all the important variables/features (e.g., price, time of 
day, day of week, day of month, month of year, prices of major competitive 
items in store, etc.), and develop a reasonable number of values for each 
that lead to different behavior, you find you have a very large 
multidimensional matrix, which for many of the elements will have only a few 
(0 - 10) observations. Theoretically, you need 20+ observations per element 
to give you statistically valid results. Fourth, often the data you get is 
"dirty", with e.g. price errors, unidentified replacement products, and so 
on. We have found that anywhere from 30 - 80% of the time required to do an 
analysis/model development task is needed to understand and clean the data 
the client provides.

There are of course other problems, but the ones above tend to be the most 
significant.

2) "we update (patch) our data-collecting software very often."

I don't understand why this is a problem. Normally, data collection software 
for business (e.g., Point of Sale cash register data) is pretty robust. I 
assume here he means that as new types of data (e.g., new 
variables/features) are discovered or developed and as dirty data is 
cleaned, that the models you develop will change. This should be done. The 
process we use to develop statistical BI models is a) clean the data, b) 
examine it to understand it as much as possible and identify important 
features/variables, c) talk to experts to develop "domain knowledge", d) 
develop with the client desired performance specifications, e) develop and 
test a model, f) figure out why the results are so bad, g) modify 
algorithms, add or subtract data types, h) repeat until results are "good 
enough", money runs out, client gets antsy, etc.

I think that changing your data structures and models is usually an 
important and necessary part of developing a model that will meet your 
client's accuracy requirements.

Nuff said.

Jack Stafurik

>
> Message: 1
> Date: Sat, 03 Mar 2007 11:23:20 -0500
> From: "Phil Henshaw" <[EMAIL PROTECTED]>
> Subject: Re: [FRIAM] Subtle problem with BI
> To: "'The Friday Morning Applied Complexity Coffee Group'"
> <[email protected]>
> Message-ID: <[EMAIL PROTECTED]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I don't quite understand the details, but sounds link a kind of 'ah ha'
> observation of both natural systems in operation and the self-reference
> dilemma of theory.   My rule is try to never change the definition of
> your measures.  It's sort of like maintaining software compatibility.
> if you arbitrarily change the structure of the data you collect you
> can't compare old and new system structures they reflect nor how your
> old and new questions relate to each other.   It's such a huge
> temptation to change your measures to fit your constantly evolving
> questions, but basically..., don't do it.  :)
>
>
>
> Phil Henshaw                       ????.?? ? `?.????
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 680 Ft. Washington Ave
> NY NY 10040
> tel: 212-795-4844
> e-mail: [EMAIL PROTECTED]
> explorations: www.synapse9.com <http://www.synapse9.com/>
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
> Behalf Of Mikhail Gorelkin
> Sent: Tuesday, February 27, 2007 5:06 PM
> To: FRIAM
> Subject: [FRIAM] Subtle problem with BI
>
>
>
> Hello all,
>
>
>
> It seems there is a subtle problem with BI (data mining, data
> visualization, etc.). Usually we assume that our data reflect adequately
> business issues (customer behavior), and in the same time we update
> (patch) our data-collecting software very often, which reflects the very
> fact of its (more or less) inadequacy! So, our data also have such
> inadequacy! but we never try to estimate it 1) to improve our software;
> 2) to make our business decision more accurate. It looks like both our
> data-collecting software and BI are linked together forming a business
> (and cybernetic!) model.
>
>
>
> Any comments?
>
>
>
> Mikhail
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> http://redfish.com/pipermail/friam_redfish.com/attachments/20070303/eb14ee4a/attachment-0001.html
>
> ------------------------------
>
> _______________________________________________
> Friam mailing list
> [email protected]
> http://redfish.com/mailman/listinfo/friam_redfish.com
>
>
> End of Friam Digest, Vol 45, Issue 3
> ************************************
>
>
> -- 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.446 / Virus Database: 268.18.6/709 - Release Date: 3/3/2007 
> 8:12 AM
>
> 


============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

Reply via email to