Ran Mingxuan created SPARK-20185:
------------------------------------

             Summary: csv decompressed incorrectly with extention other than 
'gz'
                 Key: SPARK-20185
                 URL: https://issues.apache.org/jira/browse/SPARK-20185
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0, 1.6.3
            Reporter: Ran Mingxuan


With code below:
val start_time = System.currentTimeMillis()
val gzFile = spark.read
    .format("com.databricks.spark.csv")
    .option("header", "false")
    .option("inferSchema", "false")
    .option("codec", "gzip")
    .load("/foo/someCsvFile.gz.bak")
gzFile.repartition(1).write.mode("overwrite").parquet("/foo/")

got error even if I indicated the codec:

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable
17/03/23 15:44:55 WARN ipc.Client: Exception encountered while connecting to 
the server : 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category READ is not supported in state standby. Visit 
https://s.apache.org/sbnn-error
17/03/23 15:44:58 ERROR executor.Executor: Exception in task 2.0 in stage 12.0 
(TID 977)
java.lang.NullPointerException
        at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:109)
        at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
        at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
        at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)

Have to add extension to GzipCodec  to make my code run.

import org.apache.hadoop.io.compress.GzipCodec
class BakGzipCodec extends GzipCodec {
  override def getDefaultExtension(): String = ".gz.bak"
}

I suppose the file loader should get file codec depending on option first, and 
then to extension.






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to