[jira] [Comment Edited] (SPARK-19340) Opening a file in CSV format will result in an exception if the filename contains special characters

Reza Safi (JIRA) Wed, 25 Jan 2017 01:25:45 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837424#comment-15837424
 ]


Reza Safi edited comment on SPARK-19340 at 1/25/17 9:24 AM:
------------------------------------------------------------

As I mentioned in an earlier comment the exception only occurs if we open the 
file as csv. If we open it as text, there wouldn't be an exception and the data 
will be successfully loaded.
Also we can put a file with name text\{00-1\}.txt on hdfs. If the file is in 
local file system under forexample /tmp/spark, then use something like this:  
sudo -u hdfs hadoop fs -put /tmp/spark/test%7B00-01%7D.txt /user/root
Instead of the curly brackets use their UTF equivalent.


was (Author: rezasafi):
As I mentioned in an earlier comment the exception only occurs if we open the 
file as csv. If we open it as text, there wouldn't be an exception and the data 
will be successfully loaded.
Also we can put a file with name text\{00-1\}.txt on hdfs. If the file is in 
local file system under for example /tmp/spark, then use something like this:  
sudo -u hdfs hadoop fs -put /tmp/spark/test%7B00-01%7D.txt /user/root
Instead of the curly brackets use their UTF equivalent.

> Opening a file in CSV format will result in an exception if the filename 
> contains special characters
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-19340
>                 URL: https://issues.apache.org/jira/browse/SPARK-19340
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.2.0
>            Reporter: Reza Safi
>            Priority: Minor
>
> If you want to open a file that its name is like  {noformat} "*{*}*.*" 
> {noformat} or {noformat} "*[*]*.*" {noformat} using CSV format, you will get 
> the "org.apache.spark.sql.AnalysisException: Path does not exist" whether the 
> file is a local file or on hdfs.
> This bug can be reproduced on master and all other Spark 2 branches.
> To reproduce:
> # Create a file like "test{00-1}.txt" on a local directory (like in 
> /Users/reza/test/test{00-1}.txt)
> # Run spark-shell
> # Execute this command:
> {noformat}
> val df=spark.read.option("header","false").csv("/Users/reza/test/*.txt")
> {noformat}
> You will see the following stack trace:
> {noformat}
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/Users/reza/test/test\{00-01\}.txt;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.readText(CSVFileFormat.scala:208)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:158)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:423)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:360)
>   ... 48 elided
> {noformat}
> If you put the file on hadoop (like on /user/root) when you try to run the 
> following:
> {noformat}
> val df=spark.read.option("header", false).csv("/user/root/*.txt")
> {noformat}
>  
> You will get the following exception:
> {noformat}
> org.apache.hadoop.mapred.InvalidInputException: Input Pattern 
> hdfs://hosturl/user/root/test\{00-01\}.txt matches 0 files
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1297)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
>   at org.apache.spark.rdd.RDD.take(RDD.scala:1292)
>   at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1332)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
>   at org.apache.spark.rdd.RDD.first(RDD.scala:1331)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.findFirstLine(CSVFileFormat.scala:167)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:59)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:421)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:421)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:420)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:413)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:349)
>   ... 48 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19340) Opening a file in CSV format will result in an exception if the filename contains special characters

Reply via email to