Ran Mingxuan created SPARK-20185:
------------------------------------
Summary: csv decompressed incorrectly with extention other than
'gz'
Key: SPARK-20185
URL: https://issues.apache.org/jira/browse/SPARK-20185
Project: Spark
Issue Type: Bug
Components: Input/Output
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0, 1.6.3
Reporter: Ran Mingxuan
With code below:
val start_time = System.currentTimeMillis()
val gzFile = spark.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("inferSchema", "false")
.option("codec", "gzip")
.load("/foo/someCsvFile.gz.bak")
gzFile.repartition(1).write.mode("overwrite").parquet("/foo/")
got error even if I indicated the codec:
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
17/03/23 15:44:55 WARN ipc.Client: Exception encountered while connecting to
the server :
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
Operation category READ is not supported in state standby. Visit
https://s.apache.org/sbnn-error
17/03/23 15:44:58 ERROR executor.Executor: Exception in task 2.0 in stage 12.0
(TID 977)
java.lang.NullPointerException
at
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:109)
at
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
at
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
at
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
Have to add extension to GzipCodec to make my code run.
import org.apache.hadoop.io.compress.GzipCodec
class BakGzipCodec extends GzipCodec {
override def getDefaultExtension(): String = ".gz.bak"
}
I suppose the file loader should get file codec depending on option first, and
then to extension.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]