PDEUXA opened a new issue, #945:
URL: https://github.com/apache/sedona/issues/945
Hola Sedona !
## Expected behavior
Serialisation of geom object when collecting data to Pandas or similar
## Actual behavior
Error
```
~/.venv/lib/python3.10/site-packages/pyspark/sql/dataframe.py in take(self,
num)
866 [Row(age=2, name='Alice'), Row(age=5, name='Bob')]
867 """
--> 868 return self.limit(num).collect()
869
870 def tail(self, num: int) -> List[Row]:
~/.venv/lib/python3.10/site-packages/pyspark/sql/dataframe.py in
collect(self)
816 with SCCallSiteSync(self._sc):
817 sock_info = self._jdf.collectToPython()
--> 818 return list(_load_from_socket(sock_info,
BatchedSerializer(CPickleSerializer())))
819
820 def toLocalIterator(self, prefetchPartitions: bool = False) ->
Iterator[Row]:
~/.venv/lib/python3.10/site-packages/pyspark/serializers.py in
load_stream(self, stream)
149 while True:
150 try:
--> 151 yield self._read_with_length(stream)
152 except EOFError:
153 return
~/.venv/lib/python3.10/site-packages/pyspark/serializers.py in
_read_with_length(self, stream)
171 if len(obj) < length:
172 raise EOFError
--> 173 return self.loads(obj)
174
175 def dumps(self, obj):
~/.venv/lib/python3.10/site-packages/pyspark/serializers.py in loads(self,
obj, encoding)
469
470 def loads(self, obj, encoding="bytes"):
--> 471 return cloudpickle.loads(obj, encoding=encoding)
472
473
~/.venv/lib/python3.10/site-packages/pyspark/sql/types.py in <lambda>(*a)
1727 # This is used to unpickle a Row from JVM
1728 def _create_row_inbound_converter(dataType: DataType) -> Callable:
-> 1729 return lambda *a: dataType.fromInternal(a)
1730
1731
~/.venv/lib/python3.10/site-packages/pyspark/sql/types.py in
fromInternal(self, obj)
819 if self._needSerializeAnyField:
820 # Only calling fromInternal function for fields that
need conversion
--> 821 values = [
822 f.fromInternal(v) if c else v
823 for f, v, c in zip(self.fields, obj,
self._needConversion)
~/.venv/lib/python3.10/site-packages/pyspark/sql/types.py in <listcomp>(.0)
820 # Only calling fromInternal function for fields that
need conversion
821 values = [
--> 822 f.fromInternal(v) if c else v
823 for f, v, c in zip(self.fields, obj,
self._needConversion)
824 ]
~/.venv/lib/python3.10/site-packages/pyspark/sql/types.py in
fromInternal(self, obj)
592
593 def fromInternal(self, obj: T) -> T:
--> 594 return self.dataType.fromInternal(obj)
595
596 def typeName(self) -> str: # type: ignore[override]
~/.venv/lib/python3.10/site-packages/pyspark/sql/types.py in
fromInternal(self, obj)
879 v = self._cachedSqlType().fromInternal(obj)
880 if v is not None:
--> 881 return self.deserialize(v)
882
883 def serialize(self, obj: Any) -> Any:
~/.venv/lib/python3.10/site-packages/sedona/sql/types.py in
deserialize(self, datum)
31
32 def deserialize(self, datum):
---> 33 geom, offset = geometry_serde.deserialize(datum)
34 return geom
35
~/.venv/lib/python3.10/site-packages/sedona/utils/geometry_serde.py in
deserialize(buf)
59 if buf is None:
60 return None
---> 61 return geomserde_speedup.deserialize(buf)
62
63 speedup_enabled = True
TypeError: a bytes-like object is required, not 'list'
```
## Steps to reproduce the problem
I got the following geom,
"0101000020DB0B00004D3F10ED049B4E411318961117634741" leading to this dataframe:
```
root
|-- geom: string (nullable = true)
```
I am applying the following function:
`df.withColumn("geom2", geom_from_wkb(F.unhex(F.col(geom")))`
when I show the dataframe` df.select("geom","geom2").show(2,False)`:
```
+---------------------------------------------+--------------------------------------------------+
|geom2. |geom
|
+---------------------------------------------+--------------------------------------------------+
|POINT (4011529.8520583273
3065390.1373930066)|0101000020DB0B00004D3F10ED049B4E411318961117634741|
|POINT (4009430.840070244 3009943.8371693227)
|0101000020DB0B0000F86B876BEB964E41475D28EBCBF64641|
+---------------------------------------------+--------------------------------------------------+
```
when I take, collect, or toPandas() -> the error is threw.
It was working on a previous Sedona version 1.2.1
## Settings
Sedona version = 1.4.1
Apache Spark version = 3.3.2
Apache Flink version = N/A
API type = Python
Scala version = 2.12
JRE version = 1.8
Python version = 3.10.9
Environment = Standalone
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]