Hi Martin, I've tested your proposal (reading binary and UDF getValue) and it works fine. I've actually converted the code to Scala easily. Now it's a matter of building/optimizing around it (spatial join, aggregate points per geotiff).
Best, On Fri, 20 Jan 2023 at 13:47, Martin Andersson <[email protected]> wrote: > Yes, there are lots of things to consider when processing large blobs in > Spark. What I have come to learn: > - Do the spatial join (points and the geotiff extent) with as few columns > as possible. Ideally an id only for the geotiff. After that join you can > join back the geotiff using the id. > - Aggregate the points to an array of points per geotiff. Your getValue > udf should take an array of points and return an array of values. That way > each geotiff is only loaded once. > - Parquet in Spark is not very good at handling large blobs. If reading > parquet with geotiffs is slow you can repartition() with a very large > number to force smaller row groups when writing or use Avro instead. > https://www.uber.com/en-SE/blog/hdfs-file-format-apache-spark/ > > Good luck! > > Br, > Martin Andersson > > > Den fre 20 jan. 2023 kl 13:08 skrev Pedro Mano Fernandes < > [email protected]>: > >> Thanks Martin, it sounds promising. I'll actually give it a try before >> going with geotiff conversions. >> >> I'm foreseeing some concerns, though: >> >> - I'm afraid it won't be optimal for a big geotiff - I may have to >> split the geotiff into smaller geotiffs >> - I wonder how the spatial partitioning optimization will behave in >> such approach - I may have to load smaller geotiffs and use their geometry >> to join (my coordinates against envelope boundaries) before calculating >> the >> getValue for my coordinates >> >> Best, >> >> On Fri, 20 Jan 2023 at 08:49, Martin Andersson < >> [email protected]> wrote: >> >>> I would read the geotiff files as binary: >>> https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html >>> >>> Then you can define a udf to extract values directly from the geotiffs. >>> If you're on python you can use raster.io to do that. >>> >>> In java it would look some thing like this: >>> >>> Integer getValue(byte[] geotiff, double x, double y) >>> throws IOException, TransformException { >>> try (ByteArrayInputStream inputStream = new >>> ByteArrayInputStream(geotiff)) { >>> GeoTiffReader geoTiffReader = new GeoTiffReader(inputStream); >>> GridCoverage2D grid = geoTiffReader.read(null); >>> Raster raster = grid.getRenderedImage().getData(); >>> GridGeometry2D gridGeometry = grid.getGridGeometry(); >>> >>> DirectPosition2D directPosition2D = new DirectPosition2D(x, y); >>> GridCoordinates2D gridCoordinates2D = >>> gridGeometry.worldToGrid(directPosition2D); >>> try { >>> int[] pixel = raster.getPixel(gridCoordinates2D.x, >>> gridCoordinates2D.y, new int[1]); >>> return pixel[0]; >>> } catch (ArrayIndexOutOfBoundsException exc) { >>> // point is outside the extentent >>> result.add(null); >>> } >>> } >>> } >>> >>> Br, >>> Martin Andersson >>> >>> Den ons 18 jan. 2023 kl 17:59 skrev Pedro Mano Fernandes < >>> [email protected]>: >>> >>>> Thanks for the update, guys. >>>> >>>> I'm not ready to contribute yet. >>>> >>>> In the meanwhile, the solution could be perhaps to convert GeoTiff to >>>> another format supported by Sedona. If anyone has had this use case before >>>> or has any idea, please share. >>>> >>>> Best, >>>> >>>> On Wed, 18 Jan 2023 at 09:47, Martin Andersson < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I think you are looking for something like this: >>>>> https://postgis.net/docs/RT_ST_Value.html >>>>> >>>>> The raster support in Sedona is very limited at the moment. The lack >>>>> of a proper raster type makes implementing st_value impossible. We had a >>>>> brief discussion about that recently. >>>>> https://lists.apache.org/thread/qdfcvxl6z5pb7m7ky5zsksyytyxqwv8c >>>>> >>>>> If you want to make a contribution and need some guidance, please let >>>>> me know! >>>>> >>>>> Br, >>>>> Martin Andersson >>>>> >>>>> Den ons 18 jan. 2023 kl 05:45 skrev Jia Yu <[email protected]>: >>>>> >>>>>> Hi Pedro, >>>>>> >>>>>> I got your point. Unfortunately, we don't have this function yet in >>>>>> Sedona. >>>>>> But we welcome anyone who want to contribute this to Sedona! >>>>>> >>>>>> Thanks, >>>>>> Jia >>>>>> >>>>>> On Tue, Jan 17, 2023 at 9:11 AM Pedro Mano Fernandes < >>>>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>> > Hi all, >>>>>> > >>>>>> > Any clue? Or any documentation I can refer to? >>>>>> > >>>>>> > Here goes a dummy example to better explain myself: in QGIS I can >>>>>> click a >>>>>> > point (coordinates) of the geotiff and get the value in that point >>>>>> (in this >>>>>> > case 231 of Band 1). >>>>>> > >>>>>> > [image: image.png] >>>>>> > >>>>>> > Thanks, >>>>>> > >>>>>> > On Sun, 15 Jan 2023 at 16:17, Pedro Mano Fernandes < >>>>>> [email protected]> >>>>>> > wrote: >>>>>> > >>>>>> >> Hi Jia, >>>>>> >> >>>>>> >> Thanks for the fast response. >>>>>> >> >>>>>> >> With the regular spatial join I’ll get the array of data of the >>>>>> whole >>>>>> >> geotiff polygon. I was hoping to get the data element for specific >>>>>> >> coordinates inside that polygon. In other words: I guess the array >>>>>> of data >>>>>> >> corresponds to all the positions in the polygon, but I want to >>>>>> fetch >>>>>> >> specific positions. >>>>>> >> >>>>>> >> Thanks, >>>>>> >> >>>>>> >> On Sun, 15 Jan 2023 at 01:09, Jia Yu <[email protected]> wrote: >>>>>> >> >>>>>> >>> Hi Pedro, >>>>>> >>> >>>>>> >>> Once you use Sedona geotiff reader to read those geotiffs, you >>>>>> will get >>>>>> >>> a dataframe with the following schema: >>>>>> >>> >>>>>> >>> |-- image: struct (nullable = true) >>>>>> >>> | |-- origin: string (nullable = true) >>>>>> >>> | |-- Geometry: string (nullable = true) >>>>>> >>> | |-- height: integer (nullable = true) >>>>>> >>> | |-- width: integer (nullable = true) >>>>>> >>> | |-- nBands: integer (nullable = true) >>>>>> >>> | |-- data: array (nullable = true) >>>>>> >>> | | |-- element: double (containsNull = true) >>>>>> >>> >>>>>> >>> >>>>>> >>> You can use the following way to fetch the geometry column and >>>>>> perform >>>>>> >>> the spatial join; >>>>>> >>> >>>>>> >>> geotiffDF = geotiffDF.selectExpr("image.origin as >>>>>> >>> origin","ST_GeomFromWkt(image.geometry) as Geom", "image.height >>>>>> as height", >>>>>> >>> "image.width as width", "image.data as data", "image.nBands as >>>>>> bands") >>>>>> >>> geotiffDF.createOrReplaceTempView("GeotiffDataframe") >>>>>> >>> geotiffDF.show() >>>>>> >>> >>>>>> >>> More info can be found: >>>>>> >>> >>>>>> https://sedona.apache.org/1.3.1-incubating/api/sql/Raster-loader/#geotiff-dataframe-loader >>>>>> >>> >>>>>> >>> Thanks, >>>>>> >>> Jia >>>>>> >>> >>>>>> >>> On Sat, Jan 14, 2023 at 9:10 AM Pedro Mano Fernandes < >>>>>> >>> [email protected]> wrote: >>>>>> >>> >>>>>> >>>> Hi everyone! >>>>>> >>>> >>>>>> >>>> I'm trying to use elevation data in GeoTiff format. I understand >>>>>> how to >>>>>> >>>> load the dataset, as described in >>>>>> >>>> >>>>>> >>>> >>>>>> https://sedona.staged.apache.org/api/sql/Raster-loader/#geotiff-dataframe-loader >>>>>> >>>> . >>>>>> >>>> Now I'm wondering how to join this dataframe with another one >>>>>> that >>>>>> >>>> contains >>>>>> >>>> coordinates, in order to get the elevation data for those >>>>>> coordinates. >>>>>> >>>> >>>>>> >>>> Something along these lines: >>>>>> >>>> >>>>>> >>>> pointsDF >>>>>> >>>> .join(geotiffDF, ...) >>>>>> >>>> .select("lon", "lat", "geotiff_data") >>>>>> >>>> >>>>>> >>>> Are there any examples or documentation I can follow to >>>>>> accomplish this? >>>>>> >>>> >>>>>> >>>> Thanks, >>>>>> >>>> >>>>>> >>>> -- >>>>>> >>>> Pedro Mano Fernandes >>>>>> >>>> >>>>>> >>> -- >>>>>> >> Pedro Mano Fernandes >>>>>> >> >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > Pedro Mano Fernandes >>>>>> > >>>>>> >>>>> >>>>> >>>>> -- >>>>> Hälsningar, >>>>> Martin >>>>> >>>> >>>> >>>> -- >>>> Pedro Mano Fernandes >>>> >>> >> >> -- >> Pedro Mano Fernandes >> > -- Pedro Mano Fernandes
