-----Original Message-----
From: Avi Gross <avigr...@verizon.net> 
Sent: Saturday, December 15, 2018 11:27 PM
To: 'Marc Lucke' <m...@marcsnet.com>
Subject: RE: clusters of numbers

Marc,

There are k-means implementations in python, R and other places. Most uses 
would have two or more dimensions with a goal of specifying how many clusters 
to look for and then it iterates starting with random existing points to 
cluster things near those points and then near the centers of those clusters 
until things stabilize.

Your data is 1-D. Something simpler like a bar chart makes sense. But that may 
not show underlying patterns.

I am more familiar with doing graphics in R but you can see a tabular view of 
your data:

data
  1   2   3   5   6   7   8  10  11  12  14  15  16  17  19  20  21  23  24  25 
 26  29  35  43 
124 116  97  95  89  74  57  73  48  49  38  35  20  33  21  19  14   5   4   4 
  3   1   1   1

There are clear gaps and a bar chart (which I cannot attach but could send in 
private email) does show clusters visibly.

But those may largely be an artifact of the missing info.

If you tell us more, we might be able to provide a better statistical answer. I 
assume you know how to get means and so on.

   vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 1021 7.82 6.01      6    7.12 5.93   1  43    42 1.04     1.23 0.19

Yes, the above is hard to read as I cannot use tables or a constant width font 
in this forum.

I ran a kmeans asking for 3 clusters:

1 16.512097
2  1.919881
3  7.433486

The three clusters had these scores in them:

Cluster 1: 5  6  7  8 10 11
Cluster 2:  1 2 3
Cluster 3: 12 14 15 16 17 19 20 21 23 24 25 26 29 35 43

If I run it asking for say 5 clusters:

Centers:

1  6.295238
2 11.432692
3  1.483333
4  3.000000
5 18.478261

And here are your five clusters:

5 6 7 8
10 11 12 14
1 2
3
15 16 17 19 20 21 23 24 25 26 29 35 43

If you ran this for various numbers, you might see one that makes more sense to 
you.  Or, maybe not.

We culd tell you what functions to use but if you search using keywords like 
python (or another language) followed by k-means or kmeans you can fid out what 
to install and use. In python, you would need Numpy and probably SciPy as well 
as the sklearn modules with the Kmeans function in sklearn.clusters. Note you 
can fine tune the algorithm multiple ways or run it several times as the 
results can depend on the initial guesses. And you may want to be able to make 
graphics showing the clusters, albeit it is 1-D.

Good luck.


-----Original Message-----
From: Python-list <python-list-bounces+avigross=verizon....@python.org> On 
Behalf Of Marc Lucke
Sent: Saturday, December 15, 2018 7:55 PM
To: python-list@python.org
Subject: clusters of numbers

hey guys,

I have a hobby project that sorts my email automatically for me & I want to 
improve it.  There's data science and statistical info that I'm missing, & I 
always enjoy reading about the pythonic way to do things too.

I have a list of percentage scores:

(1,11,1,7,5,7,2,2,2,10,10,1,2,2,1,7,2,1,7,5,3,8,2,6,3,2,7,2,12,3,1,2,19,3,5,1,1,7,8,8,1,5,6,7,3,14,6,1,6,7,6,15,6,3,7,2,6,23,2,7,1,21,21,8,8,3,2,20,1,3,12,3,1,2,10,16,16,15,6,5,3,2,2,11,1,14,6,3,7,1,5,3,3,14,3,7,3,5,8,3,6,17,1,1,7,3,1,2,6,1,7,7,12,6,6,2,1,6,3,6,2,1,5,1,8,10,2,6,1,7,3,5,7,7,5,7,2,5,1,19,19,1,12,5,10,2,19,1,3,19,6,1,5,11,2,1,2,5,2,5,8,2,2,2,5,3,1,21,2,3,7,10,1,8,1,3,17,17,1,5,3,10,14,1,2,14,14,1,15,6,3,2,17,17,1,1,1,2,2,3,3,2,2,7,7,2,1,2,8,2,20,3,2,3,12,7,6,5,12,2,3,11,3,1,1,8,16,10,1,6,6,6,11,1,6,5,2,5,11,1,2,10,6,14,6,3,3,5,2,6,17,15,1,2,2,17,5,3,3,5,8,1,6,3,14,3,2,1,7,2,8,11,5,14,3,19,1,3,7,3,3,8,8,6,1,3,1,14,14,10,3,2,1,12,2,3,1,2,2,6,6,7,10,10,12,24,1,21,21,5,11,12,12,2,1,19,8,6,2,1,1,19,10,6,2,15,15,7,10,14,12,14,5,11,7,12,2,1,14,10,7,10,3,17,25,10,5,5,3,12,5,2,14,5,8,1,11,5,29,2,7,20,12,14,1,10,6,17,16,6,7,11,12,3,1,23,11,10,11,5,10,6,2,17,15,20,5,10,1,17,3,7,15,5,11,6,19,14,15,7,1,2,17,8,15,10,26,6,1,2,10,6,14,12,6,1,16,6,12,10,10,14,1,6,1,6,6,12,6,6,1,2,5,10,8
 
,10,1,6,8,17,11,6,3,6,5,1,2,1,2,6,6,12,14,7,1,7,1,8,2,3,14,11,6,3,11,3,1,6,17,12,8,2,10,3,12,12,2,7,5,5,17,2,5,10,12,21,15,6,10,10,7,15,11,2,7,10,3,1,2,7,10,15,1,1,6,5,5,3,17,19,7,1,15,2,8,7,1,6,2,1,15,19,7,15,1,8,3,3,20,8,1,11,7,8,7,1,12,11,1,10,17,2,23,3,7,20,20,3,11,5,1,1,8,1,6,2,11,1,5,1,10,7,20,17,8,1,2,10,6,2,1,23,11,11,7,2,21,5,5,8,1,1,10,12,15,2,1,10,5,2,2,5,1,2,11,10,1,8,10,12,2,12,2,8,6,19,15,8,2,16,7,5,14,2,1,3,3,10,16,20,5,8,14,8,3,14,2,1,5,16,16,2,10,8,17,17,10,10,11,3,5,1,17,17,3,17,5,6,7,7,12,19,15,20,11,10,2,6,6,5,5,1,16,16,8,7,2,1,3,5,20,20,6,7,5,23,14,3,10,2,2,7,10,10,3,5,5,8,14,11,14,14,11,19,5,5,2,12,25,5,2,11,8,10,5,11,10,12,10,2,15,15,15,5,10,1,12,14,8,5,6,2,26,15,21,15,12,2,8,11,5,5,16,5,2,17,3,2,2,3,15,3,8,10,7,10,3,1,14,14,8,8,8,19,10,12,3,8,2,20,16,10,6,15,6,1,12,12,15,15,8,11,17,7,7,7,3,10,1,5,19,11,7,12,8,12,7,5,10,1,11,1,6,21,1,1,10,3,8,5,6,5,20,25,17,5,2,16,14,11,1,17,10,14,5,16,5,2,7,3,8,17,7,19,12,6,5,1,3,12,43,11,8,11,5,19,10,5,11,7,20,6,12,35,5,3,17
 
,10,2,12,6,5,21,24,15,5,10,3,15,1,12,6,3,17,3,2,3,5,5,14,11,8,1,8,10,5,25,8,7,2,6,3,11,1,11,7,3,10,7,12,10,8,6,1,1,17,3,1,1,2,19,6,10,2,2,7,5,16,3,2,11,10,7,10,21,3,5,2,21,3,14,6,7,2,24,3,17,3,21,8,5,11,17,5,6,10,5,20,1,12,2,3,20,6,11,12,14,6,6,1,14,15,12,15,6,20,7,7,19,3,7,5,16,12,6,7,2,10,3,2,11,8,6,6,5,1,11,1,15,21,14,6,3,2,2,5,6,1,3,5,3,6,20,1,15,12,2,3,3,7,1,16,5,24,10,7,1,12,16,8,26,16,15,10,19,11,6,6,5,6,5)

  & I'd like to know know whether, & how the numbers are clustered.  In an 
extreme & illustrative example, 1..10 would have zero clusters;
1,1,1,2,2,2,7,7,7 would have 3 clusters (around 1,2 & 7);
17,22,20,45,47,51,82,84,83  would have 3 clusters. (around 20, 47 & 83).  In my 
set, when I scan it, I intuitively figure there's lots of numbers close to 0 & 
a lot close to 20 (or there abouts).

I saw info about k-clusters but I'm not sure if I'm going down the right path.  
I'm interested in k-clusters & will teach myself, but my priority is working 
out this problem.

Do you know the name of the algorithm I'm trying to use?  If so, are there 
python libraries like numpy that I can leverage?  I imagine that I could 
iterate from 0 to 100% using that as an artificial mean, discard values that 
are over a standard deviation away, and count the number of scores for that 
mean; then at the end of that I could set a threshold for which the artificial 
mean would be kept something like (no attempt at correct syntax:

means={}
deviation=5
threshold=int(0.25*len(list))
for i in range 100:
   count=0
   for j in list:
     if abs(j-i) > deviation:
       count+=1
   if count > threshold:
     means[i]=count

That algorithm is entirely untested & I think it could work, it's just I don't 
want to reinvent the wheel.  Any ideas kindly appreciated.


-- 
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to