[ 
https://issues.apache.org/jira/browse/IMPALA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-13851:
-------------------------------------
    Description: 
It is a common (and currently the most efficient) way to store point data as 
double x/double y pairs instead of a single geometry (BINARY) column.

For predicates where these points must intersect a complex geometry it can be a 
useful optimization to prefilter x/y columns directly with the bounding rect of 
the complex geometry in addittiton to running the st_ predicate. An example:

{code}
WHERE st_contains(<const_geometry>, st_point(x,y)) 
{code}
can be rewritten as:
{code}
WHERE st_contains(<const_geometry>, st_point(x,y))
AND x>=st_xmin(<const_geometry>) AND y>=st_ymin(<const_geometry>)
AND x<=st_xmax(<const_geometry>) AND y<=st_ymax(<const_geometry>) 
{code}

This has two benefits:
1. the planner will move the >= <= predicates before st_contains which allows 
avoiding the expensive per row st_contains() call for points that failed the 
bounding box check  
2. >=/<= predicates can be pushed down in more cases, for example for Parquet 
min/max stat filtering or Iceberg min/max stat filtering - if the files/pages 
have limited bounding boxes then this can save IO.

An expression rewrite rule can be added that does this automatically.
An issue that blocks this (and also makes manual rewrites harder) is 
IMPALA-10349 - currently constant folding only works for STRING/BINARY with 
ascii characters, and the result of function like st_polygon is very likely to 
contain non-ascii character, so x>=st_xmin(<const_geometry>)  can't be 
converted to x >= <const_double> at compile time.

  was:
It is a common (and currently the most efficient) way to store point data as 
double x/double y pairs instead of a single geometry (BINARY) column.

For predicates where these points must intersect a complex geometry it can be a 
useful optimization to prefilter x/y columns directly with the bounding rect of 
the complex geometry in addittiton to running the st_ predicate. An example:

{code}
WHERE st_contains(<const_geometry>, st_point(x,y)) 
{code}
can be rewritten as:
{code}
WHERE st_contains(<const_geometry>, st_point(x,y))
AND x>=st_xmin(<const_geometry>) AND y>=st_ymin(<const_geometry>)
AND x<=st_xmax(<const_geometry>)  AND  y<=st_ymax(<const_geometry>) 
{code}

This has two benefits:
1. the planner will move the >= <= predicates before st_contains which allows 
avoiding the expensive per row st_contains() call for points that failed the 
bounding box check  
2. >=/<= predicates can be pushed down in more cases, for example for Parquet 
min/max stat filtering or Iceberg min/max stat filtering - if the files/pages 
have limited bounding boxes then this can save IO.

An expression rewrite rule can be added that does this automatically.
An issue that blocks this (and also makes manual rewrites harder) is 
IMPALA-10349 - currently constant folding only works for STRING/BINARY with 
ascii characters, and the result of function like st_polygon is very likely to 
contain non-ascii character, so x>=st_xmin(<const_geometry>)  can't be 
converted to x >= <const_double> at compile time.


> Add geospatial expression rewrites for lat/lon coded points
> -----------------------------------------------------------
>
>                 Key: IMPALA-13851
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13851
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>              Labels: geospatial
>
> It is a common (and currently the most efficient) way to store point data as 
> double x/double y pairs instead of a single geometry (BINARY) column.
> For predicates where these points must intersect a complex geometry it can be 
> a useful optimization to prefilter x/y columns directly with the bounding 
> rect of the complex geometry in addittiton to running the st_ predicate. An 
> example:
> {code}
> WHERE st_contains(<const_geometry>, st_point(x,y)) 
> {code}
> can be rewritten as:
> {code}
> WHERE st_contains(<const_geometry>, st_point(x,y))
> AND x>=st_xmin(<const_geometry>) AND y>=st_ymin(<const_geometry>)
> AND x<=st_xmax(<const_geometry>) AND y<=st_ymax(<const_geometry>) 
> {code}
> This has two benefits:
> 1. the planner will move the >= <= predicates before st_contains which allows 
> avoiding the expensive per row st_contains() call for points that failed the 
> bounding box check  
> 2. >=/<= predicates can be pushed down in more cases, for example for Parquet 
> min/max stat filtering or Iceberg min/max stat filtering - if the files/pages 
> have limited bounding boxes then this can save IO.
> An expression rewrite rule can be added that does this automatically.
> An issue that blocks this (and also makes manual rewrites harder) is 
> IMPALA-10349 - currently constant folding only works for STRING/BINARY with 
> ascii characters, and the result of function like st_polygon is very likely 
> to contain non-ascii character, so x>=st_xmin(<const_geometry>)  can't be 
> converted to x >= <const_double> at compile time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to