Hi Craig,

 

I’m really interested in those algorithms and study them. But I would need 
somebody to point me directly at a specific algorithm to look at. The main 
problem with choosing the right one (which couldn’t get over even my university 
teacher) is that you don’t know the number of clusters (classification problem) 
and you don’t know the number of objects in one cluster (that isn’t such a big 
deal), but each cluster has different count of objects – which is a big issue 
because it actually disqualifies all of k-means based algorithms (correct me if 
I’m wrong).

 

We were thinking of some kind of Bayesian method or a likelihood ratio 
computation, but dismissed that as we found an implementation that ran ~hours 
for ~5e5 records – I just don’t think such an approach would provide better 
results from a faster algorithm than the simple linear one (checking each row 
and updating/inserting it to catalog).

 

I actually left the algorithm to run for a smaller set of data (~80 mil. Rows) 
and on 8 cores (cca 4 of them used at a time) it ran 48 hours. That’s not that 
bad assuming it must run only once for the dataset (on DB population) – and 
then the additions will be processed fast. The result is actually cca 1.2 mil 
identifiers and 50 thousand from them are closer to each other than 1 arcsec 
(collisions) – but that is only <5% error. I think if we worked a bit on this 
algorithm we could make it faster and reduce that error percentage to maybe 1% 
which would be acceptable? What do you think?

 

Thank you very much for your effort.

 

Kind Regards,

 

Jiri Nadvornik

 

From: Craig James [mailto:cja...@emolecules.com] 
Sent: Sunday, July 27, 2014 5:35 PM
To: Jiří Nádvorník
Cc: Vitalii Tymchyshyn; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

 

Jiri,

 

If you haven't looked at clustering algorithms yet, you might want to do so. 
Your problem is a special case of clustering, where you have a large number of 
small clusters.  A good place to start is the overview on Wikipedia: 
http://en.wikipedia.org/wiki/Cluster_analysis

 

A lot of people have worked extensively on this problem, and you might find a 
good solution, or at least some ideas to guide your own algorithm. In my field 
(chemistry), researchers often need to cluster 10^6 to 10^7 chemical compounds, 
and a great deal of research has gone into efficient ways to do so.

 

Craig

 

 

On Sun, Jul 27, 2014 at 7:35 AM, Jiří Nádvorník <nadvornik...@gmail.com 
<mailto:nadvornik...@gmail.com> > wrote:

Hi Vitalii, thank you for your reply.

 

The problem you suggested can in the most pathological way be, that these 
observations are on one line. As you suggested it, the B would be in the 
middle. So A and C are not in 1 arcsec range of each other, but they must be 
within 1 arcsec of their common average coordinates. If the distances between 
A,B,C are 1 arcsec for each, the right solution is to pick B as reference 
identifier and assign A and C to it.

 

We already tried the approach you suggest with applying a grid based on the Q3C 
indexing of the database. We were not just rounding the results, but using the 
center of the Q3C “square” in which the observation took place. The result was 
poor however – 22% of the identifiers were closer to each other than 1 arcsec. 
That means that when you crossmatch the original observations to them, you 
don’t know which one to use and you have duplicates. The reason for this is 
that nearly all of the observations are from SMC (high density of 
observations), which causes that you have more than 2 “rounded” positions in a 
row and don’t know which ones to join together (compute average coordinates 
from it). If it is not clear enough I can draw it on an image for you.

Maybe the simple round up would have better results because the squares are not 
each the same size and you can scale them only by 2 (2-times smaller, or larger 
square). We used a squre with the side cca 0.76 arcsec which approximately 
covers the 1 arcsec radius circle.

 

Oh and one more important thing. The difficulty of our data is not that it is 
3e8 rows. But in the highest density, there are cca 1000 images overlapping. 
Which kills you when you try to self-join the observations to find neighbours 
for each of them – the quadratic complexity is based on the overlappingon the 
image (e.g. 10000 observations on one image with another 999 images overlapping 
it means 10000 *1000^2). 

 

Best regards,

 

Jiri Nadvornik

 

From:  <mailto:tiv...@gmail.com> tiv...@gmail.com [mailto: 
<mailto:tiv...@gmail.com> tiv...@gmail.com] On Behalf Of Vitalii Tymchyshyn
Sent: Sunday, July 27, 2014 8:06 AM
To: Jiří Nádvorník
Cc:  <mailto:pgsql-performance@postgresql.org> pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

 

I am not sure I understand the problem fully, e.g. what to do if there are 
observations A,B and C with A to B and B to C less then treshold and A to C 
over treshold, but anyway.

Could you first apply a kind of grid to your observations? What I mean is to 
round your coords to, say, 1/2 arcsec on each axe and group the results. I 
think you will have most observations grouped this way and then use your 
regular algorithm to combine the results.

Best regards, Vitalii Tymchyshyn





 

-- 

---------------------------------
Craig A. James

Chief Technology Officer

eMolecules, Inc.

---------------------------------

Reply via email to