Re: [PERFORM] Cursor + upsert (astronomical data)

Vitalii Tymchyshyn Sun, 27 Jul 2014 13:11:18 -0700

Well, that's why I said to apply regular algorithm to deduplicate after
this step. Basically, what I expect is to have first pass with group by
that do not require any joins and produces "dirty" set of identifiers.


It should do next things:
1) Provide working set of dirty identifiers that has a huge factor less
cardinality than the original observations set.
2) Most of the identifiers can be used as is, only for small fraction you
need to perform additional merge. 22% is actually very good number, it
means only 1/5 of identifiers should be analyzed for merging.

Best regards, Vitalii Tymchyshyn
27 лип. 2014 10:35, користувач "Jiří Nádvorník" <[email protected]>
написав:

> Hi Vitalii, thank you for your reply.
>
>
>
> The problem you suggested can in the most pathological way be, that these
> observations are on one line. As you suggested it, the B would be in the
> middle. So A and C are not in 1 arcsec range of each other, but they must
> be within 1 arcsec of their common average coordinates. If the distances
> between A,B,C are 1 arcsec for each, the right solution is to pick B as
> reference identifier and assign A and C to it.
>
>
>
> We already tried the approach you suggest with applying a grid based on
> the Q3C indexing of the database. We were not just rounding the results,
> but using the center of the Q3C “square” in which the observation took
> place. The result was poor however – 22% of the identifiers were closer to
> each other than 1 arcsec. That means that when you crossmatch the original
> observations to them, you don’t know which one to use and you have
> duplicates. The reason for this is that nearly all of the observations are
> from SMC (high density of observations), which causes that you have more
> than 2 “rounded” positions in a row and don’t know which ones to join
> together (compute average coordinates from it). If it is not clear enough I
> can draw it on an image for you.
>
> Maybe the simple round up would have better results because the squares
> are not each the same size and you can scale them only by 2 (2-times
> smaller, or larger square). We used a squre with the side cca 0.76 arcsec
> which approximately covers the 1 arcsec radius circle.
>
>
>
> Oh and one more important thing. The difficulty of our data is not that it
> is 3e8 rows. But in the highest density, there are cca 1000 images
> overlapping. Which kills you when you try to self-join the observations to
> find neighbours for each of them – the quadratic complexity is based on the
> overlappingon the image (e.g. 10000 observations on one image with another
> 999 images overlapping it means 10000 *1000^2).
>
>
>
> Best regards,
>
>
>
> Jiri Nadvornik
>
>
>
> *From:* [email protected] [mailto:[email protected]] *On Behalf Of *Vitalii
> Tymchyshyn
> *Sent:* Sunday, July 27, 2014 8:06 AM
> *To:* Jiří Nádvorník
> *Cc:* [email protected]
> *Subject:* Re: [PERFORM] Cursor + upsert (astronomical data)
>
>
>
> I am not sure I understand the problem fully, e.g. what to do if there are
> observations A,B and C with A to B and B to C less then treshold and A to C
> over treshold, but anyway.
>
> Could you first apply a kind of grid to your observations? What I mean is
> to round your coords to, say, 1/2 arcsec on each axe and group the results.
> I think you will have most observations grouped this way and then use your
> regular algorithm to combine the results.
>
> Best regards, Vitalii Tymchyshyn
>

Re: [PERFORM] Cursor + upsert (astronomical data)

Reply via email to