WRT to my blog post:

It seems the problem is that the distribution for lengthNorm() starts at 1 and moves down from there. 1.0f would work but HUGE documents would be normalized and so would distort the results.

What would you think of using this implementation for lengthNorm:

    public float lengthNorm( String fieldName, int numTokens ) {

int THRESHOLD = 50;
int nt = numTokens;


if ( numTokens <= THRESHOLD )
++nt;
if ( numTokens > THRESHOLD )
nt -= THRESHOLD;
float v = (float)(1.0 / Math.sqrt(nt));


        if ( numTokens <= THRESHOLD )
            v = 1 - v;

        return v;
    }

This starts the distribution low... approaches 1.0 when 50 terms are in the document... then asymptotically moves to zero from here on out based on sqrt.

For example with values from 1 -> 150 would yield (I'd graph this out but I'm too lazy):

1 - 0.29289323
2 - 0.42264974
3 - 0.5
4 - 0.5527864
5 - 0.5917517
6 - 0.6220355
7 - 0.6464466
8 - 0.6666666
9 - 0.6837722
10 - 0.69848865
11 - 0.7113249
12 - 0.72264993
13 - 0.73273873
14 - 0.74180114
15 - 0.75
16 - 0.7574644
17 - 0.7642977
18 - 0.7705843
19 - 0.7763932
20 - 0.7817821
21 - 0.7867993
22 - 0.7914856
23 - 0.79587585
24 - 0.8
25 - 0.80388385
26 - 0.8075499
27 - 0.81101775
28 - 0.81430465
29 - 0.81742585
30 - 0.8203947
31 - 0.8232233
32 - 0.82592237
33 - 0.8285014
34 - 0.83096915
35 - 0.8333333
36 - 0.83560103
37 - 0.83777857
38 - 0.8398719
39 - 0.8418861
40 - 0.84382623
41 - 0.8456966
42 - 0.8475014
43 - 0.84924436
44 - 0.8509288
45 - 0.852558
46 - 0.85413504
47 - 0.85566247
48 - 0.85714287
49 - 0.8585786
50 - 0.859972
51 - 1.0
52 - 0.70710677
53 - 0.57735026
54 - 0.5
55 - 0.4472136
56 - 0.4082483
57 - 0.37796447
58 - 0.35355338
59 - 0.33333334
60 - 0.31622776
61 - 0.30151135
62 - 0.28867513
63 - 0.2773501
64 - 0.26726124
65 - 0.2581989
66 - 0.25
67 - 0.24253562
68 - 0.23570226
69 - 0.22941573
70 - 0.2236068
71 - 0.2182179
72 - 0.21320072
73 - 0.2085144
74 - 0.20412415
75 - 0.2
76 - 0.19611613
77 - 0.19245009
78 - 0.18898223
79 - 0.18569534
80 - 0.18257418
81 - 0.1796053
82 - 0.17677669
83 - 0.17407766
84 - 0.17149858
85 - 0.16903085
86 - 0.16666667
87 - 0.16439898
88 - 0.16222142
89 - 0.16012815
90 - 0.15811388
91 - 0.15617377
92 - 0.15430336
93 - 0.15249857
94 - 0.15075567
95 - 0.1490712
96 - 0.14744195
97 - 0.145865
98 - 0.14433756
99 - 0.14285715
100 - 0.14142136
101 - 0.14002801
102 - 0.13867505
103 - 0.13736056
104 - 0.13608277
105 - 0.13483997
106 - 0.13363062
107 - 0.13245323
108 - 0.13130644
109 - 0.13018891
110 - 0.12909944
111 - 0.12803689
112 - 0.12700012
113 - 0.12598816
114 - 0.125
115 - 0.12403473
116 - 0.12309149
117 - 0.12216944
118 - 0.12126781
119 - 0.120385855
120 - 0.11952286
121 - 0.11867817
122 - 0.11785113
123 - 0.11704115
124 - 0.11624764
125 - 0.11547005
126 - 0.114707865
127 - 0.11396058
128 - 0.1132277
129 - 0.11250879
130 - 0.1118034
131 - 0.11111111
132 - 0.11043153
133 - 0.10976426
134 - 0.10910895
135 - 0.10846523
136 - 0.107832775
137 - 0.107211255
138 - 0.10660036
139 - 0.10599979
140 - 0.10540926
141 - 0.104828484
142 - 0.1042572
143 - 0.10369517
144 - 0.10314213
145 - 0.10259783
146 - 0.10206208
147 - 0.10153462
148 - 0.101015255
149 - 0.10050378


--

Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod!
Kevin A. Burton, Location - San Francisco, CA
AIM/YIM - sfburtonator, Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to