[jira] [Created] (SEDONA-225) Cannot count dataframes loaded from GeoParquet files

Kristin Cowalcijk (Jira) Thu, 29 Dec 2022 00:01:08 -0800

Kristin Cowalcijk created SEDONA-225:
----------------------------------------


             Summary: Cannot count dataframes loaded from GeoParquet files
                 Key: SEDONA-225
                 URL: https://issues.apache.org/jira/browse/SEDONA-225
             Project: Apache Sedona
          Issue Type: Bug
    Affects Versions: 1.3.0, 1.3.1
            Reporter: Kristin Cowalcijk


{{spark.read.format("geoparquet").load("/path/to/geoparquet").count()}} raises 
a {{java.lang.ClassCastException}} exception since Spark expects to load the 
dataframe in batch mode, while vectorized read was not implemented by 
{{GeoParquetFormat}}:

{code:scala}
spark.read.format("geoparquet").load("/path/to/example1.parquet").count()
22/12/29 15:58:12 WARN GeoParquetFileFormat: GeoParquet currently does not 
support vectorized reader. Falling back to parquet-mr
22/12/29 15:58:12 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.UnsafeRow cannot be cast to 
org.apache.spark.sql.vectorized.ColumnarBatch
        at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:560)
        at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:549)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (SEDONA-225) Cannot count dataframes loaded from GeoParquet files

Reply via email to