Well, that's why I said to apply regular algorithm to deduplicate after this step. Basically, what I expect is to have first pass with group by that do not require any joins and produces "dirty" set of identifiers.
It should do next things: 1) Provide working set of dirty identifiers that has a huge factor less cardinality than the original observations set. 2) Most of the identifiers can be used as is, only for small fraction you need to perform additional merge. 22% is actually very good number, it means only 1/5 of identifiers should be analyzed for merging. Best regards, Vitalii Tymchyshyn 27 лип. 2014 10:35, користувач "Jiří Nádvorník" <nadvornik...@gmail.com> написав: > Hi Vitalii, thank you for your reply. > > > > The problem you suggested can in the most pathological way be, that these > observations are on one line. As you suggested it, the B would be in the > middle. So A and C are not in 1 arcsec range of each other, but they must > be within 1 arcsec of their common average coordinates. If the distances > between A,B,C are 1 arcsec for each, the right solution is to pick B as > reference identifier and assign A and C to it. > > > > We already tried the approach you suggest with applying a grid based on > the Q3C indexing of the database. We were not just rounding the results, > but using the center of the Q3C “square” in which the observation took > place. The result was poor however – 22% of the identifiers were closer to > each other than 1 arcsec. That means that when you crossmatch the original > observations to them, you don’t know which one to use and you have > duplicates. The reason for this is that nearly all of the observations are > from SMC (high density of observations), which causes that you have more > than 2 “rounded” positions in a row and don’t know which ones to join > together (compute average coordinates from it). If it is not clear enough I > can draw it on an image for you. > > Maybe the simple round up would have better results because the squares > are not each the same size and you can scale them only by 2 (2-times > smaller, or larger square). We used a squre with the side cca 0.76 arcsec > which approximately covers the 1 arcsec radius circle. > > > > Oh and one more important thing. The difficulty of our data is not that it > is 3e8 rows. But in the highest density, there are cca 1000 images > overlapping. Which kills you when you try to self-join the observations to > find neighbours for each of them – the quadratic complexity is based on the > overlappingon the image (e.g. 10000 observations on one image with another > 999 images overlapping it means 10000 *1000^2). > > > > Best regards, > > > > Jiri Nadvornik > > > > *From:* tiv...@gmail.com [mailto:tiv...@gmail.com] *On Behalf Of *Vitalii > Tymchyshyn > *Sent:* Sunday, July 27, 2014 8:06 AM > *To:* Jiří Nádvorník > *Cc:* pgsql-performance@postgresql.org > *Subject:* Re: [PERFORM] Cursor + upsert (astronomical data) > > > > I am not sure I understand the problem fully, e.g. what to do if there are > observations A,B and C with A to B and B to C less then treshold and A to C > over treshold, but anyway. > > Could you first apply a kind of grid to your observations? What I mean is > to round your coords to, say, 1/2 arcsec on each axe and group the results. > I think you will have most observations grouped this way and then use your > regular algorithm to combine the results. > > Best regards, Vitalii Tymchyshyn >