[ https://issues.apache.org/jira/browse/SPARK-31813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117440#comment-17117440 ]
Ondrej Kokes commented on SPARK-31813: -------------------------------------- Tried a different Docker image - openjdk:8, downloaded Spark 2.4.5, unpacked, launched the spark-shell and could reproduce it there. I can't quite think of a cleaner way to reproduce the issue. > Cannot write snappy-compressed text files > ----------------------------------------- > > Key: SPARK-31813 > URL: https://issues.apache.org/jira/browse/SPARK-31813 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.4.5 > Reporter: Ondrej Kokes > Priority: Minor > > After installing pyspark (pip install pyspark) on both macOS and Ubuntu (a > clean Docker image with default-jre), Spark fails to write text-based files > (CSV and JSON) with snappy compression. It can snappy compress parquet and > orc, gzipping CSVs also works. > This is a clean PySpark installation, snappy jars are in place > {{$ ls -1 /usr/local/lib/python3.7/site-packages/pyspark/jars/ | grep snappy}} > {{snappy-0.2.jar > }}{{snappy-java-1.1.7.3.jar}} > Repro 1 (Scala): > {{{{$ spark-shell}}}} > {{spark.sql("select 1").write.option("compression", > "snappy").mode("overwrite").parquet("tmp/foo")}} > {{{{spark.sql("select 1").write.option("compression", > "snappy").mode("overwrite").csv("tmp/foo")}}}} > The first (parquet) will work, the second one won't. > Repro 2 (PySpark): > {{from pyspark.sql import SparkSession}} > {{if __name__ == '__main__':}}{{spark}} > {{ SparkSession.builder.appName('snappy_testing').getOrCreate()}} > {{ spark.sql('select 1').write.option('compression', > 'snappy').mode('overwrite').parquet('tmp/works_fine')}} > {{ spark.sql('select 1').write.option('compression', > 'gzip').mode('overwrite').csv('tmp/also_works')}} > {{ spark.sql('select 1').write.option('compression', > 'snappy').mode('overwrite').csv('tmp/snappy_not_found')}} > > In either case I get the following traceback > java.lang.RuntimeException: native snappy library not available: this version > of libhadoop was built without snappy support.java.lang.RuntimeException: > native snappy library not available: this version of libhadoop was built > without snappy support. at > org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65) > at > org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:134) > at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150) > at > org.apache.hadoop.io.compress.CompressionCodec$Util.createOutputStreamWithCodecPool(CompressionCodec.java:131) > at > org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:100) > at > org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84) > at > org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84) > at scala.Option.map(Option.scala:146) at > org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:84) > at > org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92) > at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CSVFileFormat.scala:177) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:85) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:123) at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org