[jira] [Commented] (SPARK-20185) csv decompressed incorrectly with extention other than 'gz'

Hyukjin Kwon (JIRA) Mon, 03 Apr 2017 02:27:56 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953183#comment-15953183
 ]


Hyukjin Kwon commented on SPARK-20185:
--------------------------------------

{{codec}} or {{compression}} is an option for writing out as documented.
It seems the workaround is not so difficult and the behaviour looks reasonable 
to me as well.


> csv decompressed incorrectly with extention other than 'gz'
> -----------------------------------------------------------
>
>                 Key: SPARK-20185
>                 URL: https://issues.apache.org/jira/browse/SPARK-20185
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>            Reporter: Ran Mingxuan
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> With code below:
> val start_time = System.currentTimeMillis()
> val gzFile = spark.read
>     .format("com.databricks.spark.csv")
>     .option("header", "false")
>     .option("inferSchema", "false")
>     .option("codec", "gzip")
>     .load("/foo/someCsvFile.gz.bak")
> gzFile.repartition(1).write.mode("overwrite").parquet("/foo/")
> got error even if I indicated the codec:
> WARN util.NativeCodeLoader: Unable to load native-hadoop library for your 
> platform... using builtin-java classes where applicable
> 17/03/23 15:44:55 WARN ipc.Client: Exception encountered while connecting to 
> the server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby. Visit 
> https://s.apache.org/sbnn-error
> 17/03/23 15:44:58 ERROR executor.Executor: Exception in task 2.0 in stage 
> 12.0 (TID 977)
> java.lang.NullPointerException
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:109)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
> Have to add extension to GzipCodec  to make my code run.
> import org.apache.hadoop.io.compress.GzipCodec
> class BakGzipCodec extends GzipCodec {
>   override def getDefaultExtension(): String = ".gz.bak"
> }
> I suppose the file loader should get file codec depending on option first, 
> and then to extension.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-20185) csv decompressed incorrectly with extention other than 'gz'

Reply via email to