[GitHub] [sedona] PDEUXA opened a new issue, #945: Geom serialization to python

via GitHub Fri, 04 Aug 2023 02:10:33 -0700


PDEUXA opened a new issue, #945:
URL: https://github.com/apache/sedona/issues/945


   Hola Sedona !
   
   ## Expected behavior
   
   Serialisation of geom object when collecting data to Pandas or similar
   
   ## Actual behavior
   
   Error
   ```
   
   ~/.venv/lib/python3.10/site-packages/pyspark/sql/dataframe.py in take(self, 
num)
       866         [Row(age=2, name='Alice'), Row(age=5, name='Bob')]
       867         """
   --> 868         return self.limit(num).collect()
       869 
       870     def tail(self, num: int) -> List[Row]:
   
   ~/.venv/lib/python3.10/site-packages/pyspark/sql/dataframe.py in 
collect(self)
       816         with SCCallSiteSync(self._sc):
       817             sock_info = self._jdf.collectToPython()
   --> 818         return list(_load_from_socket(sock_info, 
BatchedSerializer(CPickleSerializer())))
       819 
       820     def toLocalIterator(self, prefetchPartitions: bool = False) -> 
Iterator[Row]:
   
   ~/.venv/lib/python3.10/site-packages/pyspark/serializers.py in 
load_stream(self, stream)
       149         while True:
       150             try:
   --> 151                 yield self._read_with_length(stream)
       152             except EOFError:
       153                 return
   
   ~/.venv/lib/python3.10/site-packages/pyspark/serializers.py in 
_read_with_length(self, stream)
       171         if len(obj) < length:
       172             raise EOFError
   --> 173         return self.loads(obj)
       174 
       175     def dumps(self, obj):
   
   ~/.venv/lib/python3.10/site-packages/pyspark/serializers.py in loads(self, 
obj, encoding)
       469 
       470     def loads(self, obj, encoding="bytes"):
   --> 471         return cloudpickle.loads(obj, encoding=encoding)
       472 
       473 
   
   ~/.venv/lib/python3.10/site-packages/pyspark/sql/types.py in <lambda>(*a)
      1727 # This is used to unpickle a Row from JVM
      1728 def _create_row_inbound_converter(dataType: DataType) -> Callable:
   -> 1729     return lambda *a: dataType.fromInternal(a)
      1730 
      1731 
   
   ~/.venv/lib/python3.10/site-packages/pyspark/sql/types.py in 
fromInternal(self, obj)
       819         if self._needSerializeAnyField:
       820             # Only calling fromInternal function for fields that 
need conversion
   --> 821             values = [
       822                 f.fromInternal(v) if c else v
       823                 for f, v, c in zip(self.fields, obj, 
self._needConversion)
   
   ~/.venv/lib/python3.10/site-packages/pyspark/sql/types.py in <listcomp>(.0)
       820             # Only calling fromInternal function for fields that 
need conversion
       821             values = [
   --> 822                 f.fromInternal(v) if c else v
       823                 for f, v, c in zip(self.fields, obj, 
self._needConversion)
       824             ]
   
   ~/.venv/lib/python3.10/site-packages/pyspark/sql/types.py in 
fromInternal(self, obj)
       592 
       593     def fromInternal(self, obj: T) -> T:
   --> 594         return self.dataType.fromInternal(obj)
       595 
       596     def typeName(self) -> str:  # type: ignore[override]
   
   ~/.venv/lib/python3.10/site-packages/pyspark/sql/types.py in 
fromInternal(self, obj)
       879         v = self._cachedSqlType().fromInternal(obj)
       880         if v is not None:
   --> 881             return self.deserialize(v)
       882 
       883     def serialize(self, obj: Any) -> Any:
   
   ~/.venv/lib/python3.10/site-packages/sedona/sql/types.py in 
deserialize(self, datum)
        31 
        32     def deserialize(self, datum):
   ---> 33         geom, offset = geometry_serde.deserialize(datum)
        34         return geom
        35 
   
   ~/.venv/lib/python3.10/site-packages/sedona/utils/geometry_serde.py in 
deserialize(buf)
        59             if buf is None:
        60                 return None
   ---> 61             return geomserde_speedup.deserialize(buf)
        62 
        63         speedup_enabled = True
   
   TypeError: a bytes-like object is required, not 'list'
   ```
   
   
   ## Steps to reproduce the problem
   
   I got the following geom, 
"0101000020DB0B00004D3F10ED049B4E411318961117634741" leading to this dataframe:
   
   ```
   root
    |-- geom: string (nullable = true)
   
   ```
   
   I am applying the following function:
   
   `df.withColumn("geom2", geom_from_wkb(F.unhex(F.col(geom")))`
   
   when I show the dataframe` df.select("geom","geom2").show(2,False)`:
   
   ```
   
+---------------------------------------------+--------------------------------------------------+
   |geom2.                                             |geom                    
                |
   
+---------------------------------------------+--------------------------------------------------+
   |POINT (4011529.8520583273 
3065390.1373930066)|0101000020DB0B00004D3F10ED049B4E411318961117634741|
   |POINT (4009430.840070244 3009943.8371693227) 
|0101000020DB0B0000F86B876BEB964E41475D28EBCBF64641|
   
+---------------------------------------------+--------------------------------------------------+
   
   ```
   
   when I take, collect, or toPandas() -> the error is threw.
   It was working on a previous Sedona version 1.2.1
   
   ## Settings
   
   Sedona version = 1.4.1
   
   Apache Spark version = 3.3.2
   
   Apache Flink version = N/A
   
   API type = Python
   
   Scala version = 2.12
   
   JRE version = 1.8
   
   Python version = 3.10.9
   
   Environment = Standalone


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [sedona] PDEUXA opened a new issue, #945: Geom serialization to python

Reply via email to