Thanks for the detailed response 🙂

I think Ryan's point in the referenced issue is important - having a set of 
transforms would be important in order to have consistent support across 
engines.

Partition transforms would indeed have to do most of the heavy lifting in order 
to simplify the query plans. The table partitions should at the very least have 
the bounding box of the geospatial data contained but having a bounding box for 
every geospatial value could also make sense from a performance-perspective.

Thomas Li Fredriksen
Lead Solution Architect

p +47 452 21 055

[cid:66c88610-f58b-4c2f-bae0-84f007752fff]
–––––

www.hubocean.earth<http://www.hubocean.earth>
________________________________
From: Walaa Eldin Moustafa <wa.moust...@gmail.com>
Sent: Thursday, October 27, 2022 19:46
To: dev@iceberg.apache.org <dev@iceberg.apache.org>
Subject: Re: Geospatial/geometry support

Types, as in "POINT", etc? No, the point was to just express them as complex 
types to avoid adding them to Iceberg spec and the engines (because even if 
they were added to Iceberg spec, engines will likely not have them as first 
class citizens anyways), i.e., their POINT/geometry semantics are invisible to 
Iceberg, and are just interpretable by the application.

On Thu, Oct 27, 2022 at 10:08 AM Ryan Blue 
<b...@tabular.io<mailto:b...@tabular.io>> wrote:
Walaa,

How are those types defined? Would we need to have definitions in the Iceberg 
spec?

Ryan


On Thu, Oct 27, 2022 at 9:47 AM Walaa Eldin Moustafa 
<wa.moust...@gmail.com<mailto:wa.moust...@gmail.com>> wrote:
Thanks Ryan! To expand a bit more:

For representation, I was thinking that geometry types could be expressed as 
complex types (e.g., POINTs as Structs), so they are compatible with all 
engines without having to introduce user-defined types in both Iceberg and 
compute engines.

For the partitioning:
(1) Custom partition functions could directly operate on complex types (e.g., 
structs representing POINTs). In this case the partitioning function is like: 
geometry_hash(strcut_col); or
(2) Partitioning spec could be extended to allow "generated columns" to be 
sources of partition functions, so a "generated" WKB column can be the 
intermediate representation between complex geometry types and partition 
functions that accept primitive types. In this case, the partitioning function 
is like hashBytes(wlb(struct_col)).

Thanks,
Walaa.

On Thu, Oct 27, 2022 at 8:46 AM Ryan Blue 
<b...@tabular.io<mailto:b...@tabular.io>> wrote:
Thomas, thanks for taking the time to put this together!

I've always wanted geospatial support in the format, but thought that
it would be best to have an expert design and build it with us so we
don't get it wrong.

I think Walaa is right about the approach. We want to use partition
transforms to do the heavy lifting of finding the right files for a
query. That means that we'd need some clear but generic definition of
geospatial objects in the data, along with more specific attributes.
At a high level, I think that's probably done by storing each object
using a standard envelope definition (bbox?) that we can use in
partition transforms, and then a WKB column for the actual object.

What do you think?

Ryan

On Thu, Oct 27, 2022 at 4:03 AM Walaa Eldin Moustafa
<wa.moust...@gmail.com<mailto:wa.moust...@gmail.com>> wrote:
>
> Hi Thomas,
>
> It sounds what you are trying to achieve is to provide a custom partition 
> function? There is some discussion here
> https://github.com/apache/iceberg/issues/1482<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fissues%2F1482&data=05%7C01%7Cthomas.fredriksen%40oceandata.earth%7C4d889798f0b544d556e008dab843375a%7C4532deeec4ed44d788c679ffa513472c%7C0%7C0%7C638024896087511678%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=t%2FEIoXNzAQUmc0JIvqCLeZdZ%2B%2BGkRf8xjb2cWsP%2FY9A%3D&reserved=0>.
>  I guess supporting geometry through this framework makes more sense since it 
> does not require extending the Iceberg type system, yet general enough to 
> support other applications.
>
> Thanks,
> Walaa.
>
> On Thu, Oct 27, 2022 at 12:33 AM Thomas Fredriksen 
> <thomas.fredriksen@oceandata.earth> wrote:
>>
>> Hello everyone,
>>
>> I am working big geospatial and trying to solve very large tables in object 
>> storage. Iceberg appear to be the ideal solution but does unfortunately not 
>> appear to support geometry columns.
>>
>> The way that iceberg is structured, it appears to be a good fit with the 
>> GeoParquet-standard 
>> (https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopengeospatial%2Fgeoparquet%2Fblob%2Fmain%2Fformat-specs%2Fgeoparquet.md&data=05%7C01%7Cthomas.fredriksen%40oceandata.earth%7C4d889798f0b544d556e008dab843375a%7C4532deeec4ed44d788c679ffa513472c%7C0%7C0%7C638024896087511678%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=V%2F5%2B8mGCyUws10ZuuSPMH%2FI8WKpcE%2FwtyasePtHAWyU%3D&reserved=0>),
>>  so I created a pull request where I attempt to add this support: 
>> https://github.com/apache/iceberg/pull/6062<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F6062&data=05%7C01%7Cthomas.fredriksen%40oceandata.earth%7C4d889798f0b544d556e008dab843375a%7C4532deeec4ed44d788c679ffa513472c%7C0%7C0%7C638024896087511678%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fT1tXHllxVgXsQzNsNXq2llmHz7k%2FBrx2Gmicx2KYDk%3D&reserved=0>
>>
>> The PR deviates from GeoParquet in the CRS-field of the column metadata. 
>> GeoParquet requires the CRS to be defined as a PROJJSON JSON object, while 
>> the PR simply asks the user to specify and EPSG ID, where EPSG:4326 (WGS84 - 
>> latitude/longitude) is considered default.
>>
>> I would love feedback on the PR and welcome the discussion on whether 
>> geospatial/geometry belongs in the iceberg standard.
>>
>> Thomas Li Fredriksen
>> Lead Solution Architect
>>
>> p +47 452 21 055
>>
>>
>> –––––
>>
>> www.hubocean.earth




--
Ryan Blue
Tabular


--
Ryan Blue
Tabular

Reply via email to