No it isn't always a good idea, but it is often a good idea for some kinds of input.
More specifically, if the input is of the sort that is generated by something with normally distributed values, then normalizing the way you did is probably bad and it would be better to standardize the input (adjust to zero mean, unit variance by translation and scaling) or just leave it alone. If the input doesn't have that sort of error process, then you need to transform it into something that does. Count data, for example, doesn't have the right kind of distribution because a direct L2 or L1 comparison of different counts mostly just tells you which sample had more trials rather than what is really different. Dividing by the sum of the counts (aka L1 normalization) gives you estimates of multinomial probabilities which kind of do have normal distribution so you will be good there. Other length dependent data sources might require normalization using L2. Paradoxically, L2 normalization is very commonly used for term counts from documents rather than L1. It isn't clear what will actually be better. Frankly, I would rather move to a more advanced analysis than worry about that difference. On Mon, Jun 1, 2009 at 6:12 AM, Shashikant Kore <[email protected]>wrote: > From this issue, it seems the input vectors should be L1/L2 > normalized. Is it a good idea to always normalize the input document > vectors? If yes, can we make appropriate changes to JIRA 126 (create > document vectors from text)? > -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 http://www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)
