Will Bowditch created SEDONA-183:
------------------------------------
Summary: Python version check fails when Sedona JAR is distributed
by YARN
Key: SEDONA-183
URL: https://issues.apache.org/jira/browse/SEDONA-183
Project: Apache Sedona
Issue Type: Bug
Affects Versions: 1.2.1
Reporter: Will Bowditch
When running Sedona on YARN, the SQL API works, but the RDD API fails. For
example, this works:
```py
import pyspark
from pyspark.sql import SparkSession, Row
import pyspark.sql.functions as F
from sedona.register import SedonaRegistrator
from sedona.utils import KryoSerializer, SedonaKryoRegistrator
from sedona.utils.adapter import Adapter
spark = (
SparkSession.builder.master("yarn")
.config('spark.yarn.dist.files', '/home/hadoop/container_built.pex')
.config('spark.pyspark.python', './container_built.pex')
.config("spark.serializer", KryoSerializer.getName)
.config("spark.kryo.registrator", SedonaKryoRegistrator.getName)
.config("spark.jars.packages", ",".join(
["org.apache.hadoop:hadoop-aws:3.2.2",
"com.amazonaws:aws-java-sdk-bundle:1.11.375",
"org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.1-incubating",
"org.datasyslab:geotools-wrapper:1.1.0-25.2"
]))
.appName('demo_sedona')
.getOrCreate()
)
SedonaRegistrator.registerAll(spark)
SAMPLE_DATA = [
Row(
geometry="POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))",
lineage_id="1",
),
Row(
geometry="POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))",
lineage_id="2",
)
]
df = spark.createDataFrame(SAMPLE_DATA)
df = df.withColumn('geometry', F.expr('ST_GeomFromWKT(geometry)'))
df.show()
```
But this fails:
```py
>>> uk_rdd = Adapter.toSpatialRdd(df, "geometry")
>>> uk_rdd.analyze()
>>> uk_rdd.fieldNames
File
~/.pex/installed_wheels/971abd58450fc3192bdbcd5bc770d0c0eb950f7ad312fe639f906b4419cbaa3a/apache_sedona-1.2.1-py3-none-any.whl/sedona/core/jvm/config.py:52,
in since.<locals>.wrapper.<locals>.applier(*args, **kwargs)
47ifnot is_greater_or_equal_version(sedona_version, version):
48 logging.warning(
49f"This function is not available for \{sedona_version}, "
50f"please use version higher than \{version}"
51 )
---> 52 raise AttributeError(f"Not available before \{version} sedona version")
53 result = function(*args, **kwargs)
54return result
AttributeError: Not available before 1.0.0 sedona version
```
I think the underlying reason for this is the version check logic on the python
side doesn't support YARN distributed jars and therefore returns `None` instead
of the version:
```py
import sedona.core.jvm.config
sedona.core.jvm.config.SedonaMeta().version is None
```
The `SedonaMeta` first attempts to find the version from:
```py
spark.conf._jconf.get("spark.jars")
```
which errors on YARN, but includes Sedona when Spark is running in `local[*]`
mode.
```py
>>> spark.conf._jconf.get("spark.jars")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py",
line 1322, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line
328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o75.get.
: java.util.NoSuchElementException: spark.jars
at
org.apache.spark.sql.errors.QueryExecutionErrors$.noSuchElementExceptionError(QueryExecutionErrors.scala:1494)
at
org.apache.spark.sql.internal.SQLConf.$anonfun$getConfString$3(SQLConf.scala:4841)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:4841)
at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
```
This falls back to JARs in the SPARK HOME, however Sedona isn't listed in
`SPARK_HOME` because it's distributed by YARN using `yarn.packages`.
```py
>>> os.listdir(os.path.join(os.environ["SPARK_HOME"], "jars"))
['httpclient-4.5.9.jar', 'HikariCP-2.5.1.jar', 'commons-text-1.6.jar', '*',
'JLargeArrays-1.5.jar', 'httpcore-4.4.11.jar', 'JTransforms-3.1.jar',
'compress-lzf-1.0.3.jar', 'RoaringBitmap-0.9.0.jar',
'jackson-core-asl-1.9.13.jar', 'ST4-4.0.4.jar', 'core-1.1.2.jar',
'activation-1.1.1.jar', 'commons-logging-1.1.3.jar',
'aggdesigner-algorithm-6.0.jar', 'curator-client-2.13.0.jar',
'aircompressor-0.21.jar', 'curator-framework-2.13.0.jar',
'algebra_2.12-2.0.1.jar', 'ivy-2.5.0.jar', 'all-1.1.2.pom',
'gmetric4j-1.0.10.jar', 'annotations-17.0.0.jar', 'gson-2.2.4.jar',
'antlr-runtime-3.5.2.jar', 'guava-14.0.1.jar', 'antlr4-runtime-4.8.jar',
'commons-math3-3.4.1.jar', 'aopalliance-repackaged-2.6.1.jar',
'jackson-core-2.12.3.jar', 'arpack-2.2.1.jar',
'hadoop-client-api-3.2.1-amzn-7.jar', 'arpack_combined_all-0.1.jar',
'hive-beeline-2.3.9-amzn-2.jar', 'arrow-format-2.0.0.jar',
'hive-exec-2.3.9-amzn-2-core.jar', 'arrow-memory-core-2.0.0.jar',
'hadoop-client-runtime-3.2.1-amzn-7.jar', 'arrow-memory-netty-2.0.0.jar',
'hive-cli-2.3.9-amzn-2.jar', 'arrow-vector-2.0.0.jar', 'commons-net-3.1.jar',
'jpam-1.1.jar', 'audience-annotations-0.5.0.jar',
'hive-common-2.3.9-amzn-2.jar', 'automaton-1.11-8.jar',
'jackson-databind-2.12.3.jar', 'avro-1.10.2.jar', 'jakarta.inject-2.6.1.jar',
'avro-ipc-1.10.2.jar', 'hive-metastore-2.3.9-amzn-2.jar',
'avro-mapred-1.10.2.jar', 'jakarta.ws.rs-api-2.1.6.jar', 'blas-2.2.1.jar',
'hive-jdbc-2.3.9-amzn-2.jar', 'bonecp-0.8.0.RELEASE.jar',
'hive-llap-common-2.3.9-amzn-2.jar', 'breeze-macros_2.12-1.2.jar',
'javax.jdo-3.2.0-m3.jar', 'breeze_2.12-1.2.jar', 'hive-service-rpc-3.1.2.jar',
'cats-kernel_2.12-2.1.1.jar', 'hive-serde-2.3.9-amzn-2.jar',
'chill-java-0.10.0.jar', 'hk2-api-2.6.1.jar', 'chill_2.12-0.10.0.jar',
'janino-3.0.16.jar', 'commons-cli-1.2.jar', 'hive-shims-0.23-2.3.9-amzn-2.jar',
'commons-codec-1.15.jar', 'commons-pool-1.5.4.jar',
'commons-collections-3.2.2.jar', 'hive-shims-2.3.9-amzn-2.jar',
'commons-compiler-3.0.16.jar', 'hive-shims-common-2.3.9-amzn-2.jar',
'commons-compress-1.21.jar', 'hive-storage-api-2.7.2.jar',
'commons-crypto-1.1.0.jar', 'hk2-locator-2.6.1.jar', 'commons-dbcp-1.4.jar',
'hk2-utils-2.6.1.jar', 'commons-io-2.8.0.jar',
'htrace-core4-4.1.0-incubating.jar', 'commons-lang-2.6.jar',
'istack-commons-runtime-3.0.8.jar', 'commons-lang3-3.12.0.jar',
'curator-recipes-2.13.0.jar', 'javassist-3.25.0-GA.jar', 'derby-10.14.2.0.jar',
'javax.servlet-api-3.1.0.jar', 'disruptor-3.3.7.jar',
'flatbuffers-java-1.9.0.jar', 'javolution-5.5.1.jar',
'dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar',
'jaxb-runtime-2.3.2.jar', 'generex-1.0.2.jar', 'jaxb-api-2.2.11.jar',
'netty-transport-native-epoll-4.1.74.Final-linux-aarch_64.jar',
'hadoop-yarn-server-web-proxy-3.2.1-amzn-7.jar', 'metrics-graphite-4.2.0.jar',
'xz-1.8.jar', 'hive-shims-scheduler-2.3.9-amzn-2.jar', 'metrics-jmx-4.2.0.jar',
'hive-vector-code-gen-2.3.9-amzn-2.jar', 'native_ref-java-1.1.jar',
'jackson-annotations-2.12.3.jar', 'native_system-java-1.1.jar',
'jackson-dataformat-yaml-2.12.3.jar',
'netlib-native_ref-linux-i686-1.1-natives.jar',
'jackson-datatype-jsr310-2.11.2.jar',
'netlib-native_ref-osx-x86_64-1.1-natives.jar',
'jackson-mapper-asl-1.9.13.jar', 'metrics-json-4.2.0.jar',
'jackson-module-scala_2.12-2.12.3.jar',
'netlib-native_ref-win-i686-1.1-natives.jar',
'jakarta.annotation-api-1.3.5.jar',
'netlib-native_ref-win-x86_64-1.1-natives.jar',
'jakarta.servlet-api-4.0.3.jar', 'netty-all-4.1.74.Final.jar',
'jakarta.validation-api-2.0.2.jar', 'netty-buffer-4.1.74.Final.jar',
'jakarta.xml.bind-api-2.3.2.jar', 'objenesis-2.6.jar',
'jcl-over-slf4j-1.7.30.jar', 'protobuf-java-2.5.0.jar', 'jdo-api-3.0.1.jar',
'okhttp-3.12.12.jar', 'jersey-client-2.34.jar', 'okio-1.14.0.jar',
'jersey-common-2.34.jar', 'netty-codec-4.1.74.Final.jar',
'jersey-container-servlet-2.34.jar', 'metrics-jvm-4.2.0.jar',
'jersey-container-servlet-core-2.34.jar', 'pyrolite-4.30.jar',
'jersey-hk2-2.34.jar', 'opencsv-2.3.jar', 'jersey-server-2.34.jar',
'netty-common-4.1.74.Final.jar', 'jetty-rewrite-9.3.27.v20190418.jar',
'py4j-0.10.9.3.jar', 'jline-2.14.6.jar', 'rocksdbjni-6.20.3.jar',
'jniloader-1.1.jar', 'orc-core-1.6.12.jar', 'joda-time-2.10.10.jar',
'remotetea-oncrpc-1.1.2.jar', 'jodd-core-3.5.2.jar',
'scala-compiler-2.12.15.jar', 'json-1.8.jar', 'netty-handler-4.1.74.Final.jar',
'json4s-ast_2.12-3.7.0-M11.jar', 'netty-resolver-4.1.74.Final.jar',
'json4s-core_2.12-3.7.0-M11.jar', 'netty-tcnative-classes-2.0.48.Final.jar',
'json4s-jackson_2.12-3.7.0-M11.jar', 'netty-transport-4.1.74.Final.jar',
'json4s-scalap_2.12-3.7.0-M11.jar', 'scala-library-2.12.15.jar',
'jsr305-3.0.0.jar', 'shims-0.9.0.jar', 'jta-1.1.jar',
'orc-mapreduce-1.6.12.jar', 'jul-to-slf4j-1.7.30.jar', 'orc-shims-1.6.12.jar',
'kryo-shaded-4.0.2.jar', 'scala-reflect-2.12.15.jar', 'lapack-2.2.1.jar',
'oro-2.0.8.jar', 'leveldbjni-all-1.8.jar', 'scala-xml_2.12-1.2.0.jar',
'libfb303-0.9.3.jar', 'osgi-resource-locator-1.0.3.jar',
'libthrift-0.12.0.jar', 'shapeless_2.12-2.3.3.jar', 'log4j-1.2.17.jar',
'parquet-format-structures-1.12.2-amzn-0.jar',
'logging-interceptor-3.12.12.jar', 'slf4j-api-1.7.30.jar',
'lz4-java-1.7.1.jar', 'paranamer-2.8.jar', 'macro-compat_2.12-1.1.1.jar',
'parquet-column-1.12.2-amzn-0.jar', 'metrics-core-4.2.0.jar',
'minlog-1.3.0.jar', 'snakeyaml-1.27.jar',
'netlib-native_ref-linux-armhf-1.1-natives.jar',
'netlib-native_ref-linux-x86_64-1.1-natives.jar',
'netlib-native_system-linux-armhf-1.1-natives.jar',
'netlib-native_system-linux-i686-1.1-natives.jar',
'netlib-native_system-linux-x86_64-1.1-natives.jar',
'netlib-native_system-osx-x86_64-1.1-natives.jar',
'netlib-native_system-win-i686-1.1-natives.jar',
'netlib-native_system-win-x86_64-1.1-natives.jar',
'netty-transport-classes-epoll-4.1.74.Final.jar',
'netty-transport-classes-kqueue-4.1.74.Final.jar', 'emr-spark-goodies.jar',
'javax.inject-1.jar',
'netty-transport-native-epoll-4.1.74.Final-linux-x86_64.jar',
'spark-unsafe_2.12-3.2.1-amzn-0.jar', 'jsr305-3.0.2.jar',
'netty-transport-native-kqueue-4.1.74.Final-osx-aarch_64.jar',
'spark-yarn_2.12-3.2.1-amzn-0.jar',
'netty-transport-native-kqueue-4.1.74.Final-osx-x86_64.jar',
'spire-macros_2.12-0.17.0.jar',
'netty-transport-native-unix-common-4.1.74.Final.jar',
'parquet-common-1.12.2-amzn-0.jar', 'parquet-encoding-1.12.2-amzn-0.jar',
'zjsonpatch-0.3.0.jar', 'parquet-hadoop-1.12.2-amzn-0.jar',
'zookeeper-3.6.2.jar', 'parquet-jackson-1.12.2-amzn-0.jar',
'spire-util_2.12-0.17.0.jar', 'scala-collection-compat_2.12-2.1.1.jar',
'spire_2.12-0.17.0.jar', 'scala-parser-combinators_2.12-1.1.2.jar',
'findbugs-annotations-3.0.1.jar', 'slf4j-log4j12-1.7.30.jar',
'ion-java-1.0.2.jar', 'snappy-java-1.1.8.4.jar', 'stax-api-1.0.1.jar',
'spark-catalyst_2.12-3.2.1-amzn-0.jar', 'zookeeper-jute-3.6.2.jar',
'spark-core_2.12-3.2.1-amzn-0.jar', 'stream-2.9.6.jar',
'spark-ganglia-lgpl_2.12-3.2.1-amzn-0.jar', 'zstd-jni-1.5.0-4.jar',
'spark-graphx_2.12-3.2.1-amzn-0.jar', 'spire-platform_2.12-0.17.0.jar',
'spark-hive-thriftserver_2.12-3.2.1-amzn-0.jar',
'datanucleus-api-jdo-4.2.4.jar', 'spark-hive_2.12-3.2.1-amzn-0.jar',
'datanucleus-core-4.1.17.jar', 'spark-kvstore_2.12-3.2.1-amzn-0.jar',
'super-csv-2.2.0.jar', 'spark-launcher_2.12-3.2.1-amzn-0.jar',
'threeten-extra-1.5.0.jar', 'spark-mllib-local_2.12-3.2.1-amzn-0.jar',
'datanucleus-rdbms-4.1.19.jar', 'spark-mllib_2.12-3.2.1-amzn-0.jar',
'tink-1.6.0.jar', 'spark-network-common_2.12-3.2.1-amzn-0.jar',
'transaction-api-1.1.jar', 'spark-network-shuffle_2.12-3.2.1-amzn-0.jar',
'annotations-16.0.2.jar', 'spark-repl_2.12-3.2.1-amzn-0.jar',
'aopalliance-1.0.jar', 'spark-sketch_2.12-3.2.1-amzn-0.jar',
'bcprov-ext-jdk15on-1.66.jar', 'spark-sql_2.12-3.2.1-amzn-0.jar',
'univocity-parsers-2.9.1.jar', 'spark-streaming_2.12-3.2.1-amzn-0.jar',
'velocity-1.5.jar', 'spark-tags_2.12-3.2.1-amzn-0-tests.jar',
'emrfs-hadoop-assembly-2.52.0.jar', 'spark-tags_2.12-3.2.1-amzn-0.jar',
'xbean-asm9-shaded-4.20.jar', 'jmespath-java-1.12.170.jar',
'mariadb-connector-java.jar']
>>> os.environ['SPARK_HOME']
'/usr/lib/spark'
```
I think the fix is also check `spark.yarn.dist.jars` for packages, e.g.:
```py
>>> spark.conf._jconf.get("spark.yarn.dist.jars")
'file:///home/hadoop/.ivy2/jars/org.apache.sedona_sedona-python-adapter-3.0_2.12-1.2.1-incubating.jar,file:///home/hadoop/.ivy2/jars/org.datasyslab_geotools-wrapper-1.1.0-25.2.jar,file:///home/hadoop/.ivy2/jars/org.locationtech.jts_jts-core-1.18.2.jar,file:///home/hadoop/.ivy2/jars/org.wololo_jts2geojson-0.16.1.jar,file:///home/hadoop/.ivy2/jars/org.apache.sedona_sedona-core-3.0_2.12-1.2.1-incubating.jar,file:///home/hadoop/.ivy2/jars/org.apache.sedona_sedona-sql-3.0_2.12-1.2.1-incubating.jar,file:///home/hadoop/.ivy2/jars/org.scala-lang.modules_scala-collection-compat_2.12-2.5.0.jar'
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)