[ 
https://issues.apache.org/jira/browse/SPARK-25199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592704#comment-16592704
 ] 

Maxim Gekk commented on SPARK-25199:
------------------------------------

I wasn't able to reproduce the issue on the current master:
{code}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
      /_/

Using Python version 2.7.15 (default, Aug 22 2018 16:36:18)
>>> df = spark.read.format("csv").option("header", 
>>> "true").option("inferSchema", "true").load("tmp/csv/*.csv")
>>> df.printSchema()
root
 |-- a: integer (nullable = true)
 |-- b: integer (nullable = true)
{code}
for two csv files but one of them is empty:
{code:java}
tree -h ./csv
./csv
├── [   8]  1.csv
└── [   0]  2.csv
{code}

> InferSchema "all Strings" if one of many CSVs is empty
> ------------------------------------------------------
>
>                 Key: SPARK-25199
>                 URL: https://issues.apache.org/jira/browse/SPARK-25199
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.2.1
>         Environment: I discovered this on AWS Glue, which uses Spark 2.2.1
>            Reporter: Neil McGuigan
>            Priority: Minor
>              Labels: newbie
>
> Spark can load multiple CSV files in one read:
> df = spark.read.format("csv").option("header", "true").option("inferSchema", 
> "true").load("/*.csv")
> However, if one of these files is empty (though it has a header), Spark will 
> set all column types to "String"
> Spark should skip a file for inference if it contains no (non-header) rows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to