Re: [DISCUSS] CRS-Constrained Indexing for Geometry in AsterixDB

Mike Carey Sun, 07 Sep 2025 08:02:26 -0700

Sounds like a valuable project! We should look carefully at thedetailed syntax for providing the CRS info when the time comes, but Ilike the direction. Also, we should figure out if/how this may interactwith our new heterogeneous indexes (as the goal for them is to mostlyreplace the older form of indexes that require type knowledge).


On 9/4/25 12:11 AM, Suryaa Charan Shivakumar wrote:

Hello all,

I’d like to start a discussion on how we might support CRS aware indexing
for* geometry types* in AsterixDB as part of our geospatial work.

*Current Situation*

- Today we only have the geometry type (AGEOMETRY), with no CRS
constraint.
- This makes it flexible, but correctness and performance issues arise
when mixed-CRS data is indexed and eventually queried.

One idea is to introduce CRS-constrained geometry types, so a field can
explicitly declare a CRS,

```

CREATE TYPE LocationType AS {

coordinates: geometry(EPSG:4326), -- CRS-constrained geometry

mixed_coordinates: geometry, -- Unconstrained

address: string

};

CREATE DATASET Locations(LocationType) PRIMARY KEY id;

CREATE INDEX geo_idx ON Locations(coordinates) TYPE RTREE; -- CRS
auto-inferred

```

This separation allows users to decide:

- *Constrained geometry*: safe for indexing and cross-column operations.
- *Unconstrained geometry*: flexible for heterogeneous CRS use cases

Benefits

- *Clarity*: Schema explicitly encodes CRS.
- *Optionality*: Mixed CRS still possible where needed.
- *Efficiency*: Query planner can reason about CRS at compile time (no
runtime lookups).
- *Safety*: Cross-column CRS compatibility can be validated early.
- *Distributed-friendly*: CRS info can travel with type metadata;
workers validate independently.

Flow diagram:
https://www.mermaidchart.com/play?utm_source=mermaid_live_editor&utm_medium=toggle#pako:eNp9Uttu2kAQ_ZWRpUjpAz_Qh1YBQ5OUWwCpqkwetusx3sa7a-2uRSjk3zsenMWoVfxmzxmf2xwTaXNMPidFZfeyFC7AJt0aoOfumIogYC3RCKfs17etOQ9ubmDohJEl-g4Jg8EXGGbbZI57yGnLY4ABjFZrkNb44IQymEM41LhNnvtLj7Q0flU-KLPrbVa4E_JwWYjELQHL-mHdS6v5PPHNr50TddnO23H2D65jBRgy74h4UyxIFpPAXoUSdmg1BneAW1L-KSoFGPFOmj0Yj5SQdVBZkbPeiEkZMz6mFj071yLIEnKUlXBknj61EZ7B4xZ8-on-BJPsTkqsAwiTgw_WYdTxfA2f2xN8I90r_I2SVVCwxhfWabi1dVDWiKove8KS7rORQxEQfC2CEhUok-NrBN0z6IG80VealehUODsonNXgqWd9sfnA8O_Zgvi0-oMOGt8ZVqawIAJVrmtVUbBKY7eHJu-VGAv_qMkpnwCX-X98lPTIkqbZqET5AqrgXkB5EgSWCiZtETtl7Ox82o1RHN5VNbOY9bzNujGw3sAmBh0slFbbHRoy38t6dml0kS2dlUid81W9s1xdy5x1LN5fF_y6_LipJYOeuqaE942m5HsmIvKJkats0oSG7knx2XrQjQ_dWfKfeztUUPL2F7cRRQM

But it doesn't follow the loose typing preferred in the world of
semi-structured data. It however ensures performance, valid results for
queries and complete index. Other ideas include the below,

- Users should be warned about *lossy transformations* if we enforce
converting everything to a single projection (e.g., 4326).
- Another option might be to support *multiple indexes* on the same
dataset, each tied to a different CRS (with a fixed practical limit, say
10), extending R-Tree physically to handle multiple CRS domains.
- Treat CRS almost like a *type domain similar *to heterogenous indexing,
rather than a hard constraint at the type level.

*Questions for the community:*

- Should we enforce CRS constraints strictly at the type level, or
consider index-level CRS flexibility (multiple CRSs per index)?
- How should we handle schema evolution and legacy datasets without CRS
metadata?
- Are there better approaches to balance correctness, performance, and
flexibility?

Looking forward to everyone’s thoughts.

Additional Context below -

*What is a CRS?*

A *Coordinate Reference System (CRS)* defines how numbers in a geometry
(like (x,y) pairs) map to real-world locations.

- For example, in EPSG:4326 (WGS84), coordinates are in *longitude/latitude
degrees*.
- In EPSG:3857 (Web Mercator), the same numbers represent *meters on a
projected plane*.

Without CRS information, two geometries may use different measurement
systems, and a “distance” or “intersection” operation between them would be
meaningless.

*How AsterixDB Handles CRS Today (as per current patchset in review and
APE)*

- We have a single type geometry with *no schema-level CRS constraint*.
- Each geometry object in WKB (Well-Known Binary) carries only a *reference
ID* (e.g., 4326), not the full definition.
- CRS definitions themselves (EPSG codes, PROJ strings, etc.) are loaded
into memory (Apache Derby) with an API call.
- When a user calls ST_Transform, we look up the definition using the
stored ID and perform transformations with *Apache SIS*.
- This design means CRS enforcement is *runtime-only*: validation or
transformation happens at query execution, not at schema declaration.

*Why Indexes Must Be in the Same CRS*Spatial indexes (R-Trees) compare
bounding boxes of geometries. If geometries use different CRSs:

- One geometry might be in *degrees*, another in *meters*, another in
*feet*.
- Mixing them breaks the index, since it assumes all values are in a
single coordinate system.

*Analogy: *It’s like cataloging items in a warehouse. If half the
measurements are in inches and half in centimeters, the index will be
inconsistent, a “short” object in inches could look bigger than a “long”
object in centimeters.

Thank you,
Suryaa

Re: [DISCUSS] CRS-Constrained Indexing for Geometry in AsterixDB

Reply via email to