[
https://issues.apache.org/jira/browse/SPARK-20185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953183#comment-15953183
]
Hyukjin Kwon commented on SPARK-20185:
--------------------------------------
{{codec}} or {{compression}} is an option for writing out as documented.
It seems the workaround is not so difficult and the behaviour looks reasonable
to me as well.
> csv decompressed incorrectly with extention other than 'gz'
> -----------------------------------------------------------
>
> Key: SPARK-20185
> URL: https://issues.apache.org/jira/browse/SPARK-20185
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
> Reporter: Ran Mingxuan
> Priority: Minor
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> With code below:
> val start_time = System.currentTimeMillis()
> val gzFile = spark.read
> .format("com.databricks.spark.csv")
> .option("header", "false")
> .option("inferSchema", "false")
> .option("codec", "gzip")
> .load("/foo/someCsvFile.gz.bak")
> gzFile.repartition(1).write.mode("overwrite").parquet("/foo/")
> got error even if I indicated the codec:
> WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
> platform... using builtin-java classes where applicable
> 17/03/23 15:44:55 WARN ipc.Client: Exception encountered while connecting to
> the server :
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category READ is not supported in state standby. Visit
> https://s.apache.org/sbnn-error
> 17/03/23 15:44:58 ERROR executor.Executor: Exception in task 2.0 in stage
> 12.0 (TID 977)
> java.lang.NullPointerException
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:109)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
> Have to add extension to GzipCodec to make my code run.
> import org.apache.hadoop.io.compress.GzipCodec
> class BakGzipCodec extends GzipCodec {
> override def getDefaultExtension(): String = ".gz.bak"
> }
> I suppose the file loader should get file codec depending on option first,
> and then to extension.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]