GitHub user sureshthalamati opened a pull request:

    https://github.com/apache/spark/pull/12904

    [SPARK-15125][SQL] New option to the CSV data source to allows users to 
specify to how to interpret empty quoted strings. 

    ## What changes were proposed in this pull request?
    This patch adds new boolean option emptyAsNull to the CSV data source for 
user to specify whether empty quoted strings should be  interpreted as null or 
as an empty string.  Default is to interpret as null to match the current 
behavior. 
    
    Example:
    input data :
    year,make,model,comment,price
    2016,Chevy,Bolt,"",29000.00
    2015,Porsche,"",,
    
    emptyAsNull  = true (default) (current behaviour) 
    
    scala> val df= sqlContext.read.format("csv").option("header", 
"true").option("inferSchema", "true").option("nullValue", 
null).load("/tmp/test.csv")
    
    scala> df.filter("model is null").show
    +----+-------+-----+-------+-----+
    |year|   make|model|comment|price|
    +----+-------+-----+-------+-----+
    |2015|Porsche| null|   null| null|
    +----+-------+-----+-------+-----+
    
    val df= sqlContext.read.format("csv").option("header", 
"true").option("inferSchema", "true").option("nullValue", 
null).option("emptyAsNull", "false").load("/tmp/test.csv")
    
    scala> df.filter("model is null").show
    +----+----+-----+-------+-----+
    |year|make|model|comment|price|
    +----+----+-----+-------+-----+
    +----+----+-----+-------+-----+
    
    ## How was this patch tested?
    
    Added new unit tests to the CSVSuite. 
    
    @falaki @rxin 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sureshthalamati/spark 
empstring_fix_spark-15125

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12904.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12904
    
----
commit e6207e73e547d4ff1b564ec2ffc8a10cd7c00b02
Author: sureshthalamati <[email protected]>
Date:   2016-05-04T18:35:59Z

    This patch adds boolean option emptyAsNull to CSV datasource for user to 
specify to interpret empty quoted strings as null or an empty string.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to