[ 
https://issues.apache.org/jira/browse/SPARK-31813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117440#comment-17117440
 ] 

Ondrej Kokes commented on SPARK-31813:
--------------------------------------

Tried a different Docker image - openjdk:8, downloaded Spark 2.4.5, unpacked, 
launched the spark-shell and could reproduce it there. I can't quite think of a 
cleaner way to reproduce the issue.

> Cannot write snappy-compressed text files
> -----------------------------------------
>
>                 Key: SPARK-31813
>                 URL: https://issues.apache.org/jira/browse/SPARK-31813
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.4.5
>            Reporter: Ondrej Kokes
>            Priority: Minor
>
> After installing pyspark (pip install pyspark) on both macOS and Ubuntu (a 
> clean Docker image with default-jre), Spark fails to write text-based files 
> (CSV and JSON) with snappy compression. It can snappy compress parquet and 
> orc, gzipping CSVs also works.
> This is a clean PySpark installation, snappy jars are in place
> {{$ ls -1 /usr/local/lib/python3.7/site-packages/pyspark/jars/ | grep snappy}}
> {{snappy-0.2.jar
> }}{{snappy-java-1.1.7.3.jar}}
> Repro 1 (Scala):
> {{{{$ spark-shell}}}}
> {{spark.sql("select 1").write.option("compression", 
> "snappy").mode("overwrite").parquet("tmp/foo")}}
> {{{{spark.sql("select 1").write.option("compression", 
> "snappy").mode("overwrite").csv("tmp/foo")}}}}
> The first (parquet) will work, the second one won't.
> Repro 2 (PySpark):
>  {{from pyspark.sql import SparkSession}}
>  {{if __name__ == '__main__':}}{{spark}}
>  {{  SparkSession.builder.appName('snappy_testing').getOrCreate()}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'snappy').mode('overwrite').parquet('tmp/works_fine')}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'gzip').mode('overwrite').csv('tmp/also_works')}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'snappy').mode('overwrite').csv('tmp/snappy_not_found')}}
>   
>  In either case I get the following traceback
> java.lang.RuntimeException: native snappy library not available: this version 
> of libhadoop was built without snappy support.java.lang.RuntimeException: 
> native snappy library not available: this version of libhadoop was built 
> without snappy support. at 
> org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
>  at 
> org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:134)
>  at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150) 
> at 
> org.apache.hadoop.io.compress.CompressionCodec$Util.createOutputStreamWithCodecPool(CompressionCodec.java:131)
>  at 
> org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:100)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84)
>  at scala.Option.map(Option.scala:146) at 
> org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CSVFileFormat.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:85)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:123) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to