[
https://issues.apache.org/jira/browse/IMPALA-14570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Csaba Ringhofer resolved IMPALA-14570.
--------------------------------------
Resolution: Fixed
> Speed up simple geospatial queries
> ----------------------------------
>
> Key: IMPALA-14570
> URL: https://issues.apache.org/jira/browse/IMPALA-14570
> Project: IMPALA
> Issue Type: Epic
> Components: Backend, Frontend
> Reporter: Csaba Ringhofer
> Assignee: Csaba Ringhofer
> Priority: Critical
>
> IMPALA-11745 added many geospatial functions by importing them from Hive.
> The current solution is very slow even for simple queries for several reasons:
> 1. the functions are in Java and each call has native->JVM overhead
> 2. most functions need to serialize/deserialize the BINARY format on each call
> 3. the internal BINARY format "ESRI shape" is not what usually files contain,
> so per row conversion is needed before any filters
> 4. predicates like st_contains are always computed per row, no bounding rect
> check is done at file level, while Parquet/GeoParquet/Iceberg offers ways to
> do this
> There are no geospatial benchmarks for Impala yet, but my assumption that
> many queries have selective scans and the most perf critical part is the
> initial filtering of rows, so the focus is on queries like:
> SELECT * FROM t WHERE st_intersects(geom_col, <const_geom>)
> Geospatial joins are also very inefficient at the moment (nested loop join)
> but out of scope for this epic.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)