robertnagy1 opened a new issue, #876:
URL: https://github.com/apache/sedona/issues/876

   ## Expected behavior
   I have read in a shapefile as an RDD (approximately 4 million rows and about 
5 columns.
   I am trying to set the Buffer max to a higher number, but no matter how high 
it is, I still get an error regarding Buffer.max when trying to call  
rdd.countWithoutDuplicates() . This then makes me curious: Does 
config("spark.kryoserializer.buffer.max","50g") have any effect=
   
   spark = SparkSession.\
       builder.\
       master("local[*]").\
       appName("Sedona App").\
       config("spark.serializer", KryoSerializer.getName).\
       config("spark.kryo.registrator", SedonaKryoRegistrator.getName).\
       config("spark.kryoserializer.buffer","50g").\
       config("spark.kryoserializer.buffer.max","50g").\
       config('spark.executor.memory', "2g").\
       config("spark.driver.memory", "3g").\
       config("spark.jars.packages", 
"org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.0,org.datasyslab:geotools-wrapper:1.4.0-28.2").\
       getOrCreate()
   
   ## Actual behavior
   
   No matter how high the spark.kryoserializer.buffer.max is the error is the 
same:
   
   Py4JJavaError: An error occurred while calling o4019.countWithoutDuplicates.
   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in stage 27.0 
(TID 35) (vm-78d56181 executor 2): org.apache.spark.SparkException: Kryo 
serialization failed: Buffer overflow. Available: 6, required: 8
   
   ## Steps to reproduce the problem
   Download a large shapefile, create a spark session as described above and 
try to run countWithoutDuplicates() on it.
   
   
   ## Settings
   
   Sedona version = 1.4.1
   
   Apache Spark version = 3.3.1.5.2-92314920
   
   
   API type = Python
   
   Scala version = 2.12
   
   
   
   Python version = 3.10
   
   Environment = Azure Synapse spark pool


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to