[GitHub] spark pull request #13984: [SPARK-16310][SPARKR] R na.string-like default fo...

felixcheung Mon, 04 Jul 2016 14:25:04 -0700

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13984#discussion_r69492220
  
    --- Diff: R/pkg/R/SQLContext.R ---
    @@ -744,6 +747,9 @@ read.df.default <- function(path = NULL, source = NULL, 
schema = NULL, ...) {
       if (is.null(source)) {
         source <- getDefaultSqlSource()
       }
    +  if (source == "csv" && is.null(options[["nullValue"]])) {
    --- End diff --
    
    AFAIK, R read.table is equivalent to read.csv, read.csv2 or read.delim - 
and only for delimited text file:
    https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
    
    Unlike in delimited/csv file, R NA is typically null in JSON, represented as
      "myString": null
    (
    But there is no consistent approach from what I can see in R. There is no 
support for JSON in Base. There are jsonlite, RJSONIO, rjson, and it could be 
`na` or `.na` (but again typically default to "null")
    
    I think it will be an interesting to support custom null/NA mapping for 
other text data sources.
    
    From what I can see nullValue is only supported in Spark for csv data 
source.
    
https://github.com/apache/spark/blob/0ad6ce7e54b1d8f5946dde652fa5341d15059158/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L373
    
    
    _____________________________
    From: Shivaram Venkataraman 
<[email protected]<mailto:[email protected]>>
    Sent: Monday, July 4, 2016 12:33 PM
    Subject: Re: [apache/spark] [SPARK-16310][SPARKR] R na.string-like default 
for csv source (#13984)
    To: apache/spark <[email protected]<mailto:[email protected]>>
    Cc: Felix Cheung 
<[email protected]<mailto:[email protected]>>, Author 
<[email protected]<mailto:[email protected]>>
    
    
    
    In 
R/pkg/R/SQLContext.R<https://github.com/apache/spark/pull/13984#discussion_r69487213>:
    
    > @@ -744,6 +747,9 @@ read.df.default <- function(path = NULL, source = 
NULL, schema = NULL, ...) {>    if (is.null(source)) {>      source <- 
getDefaultSqlSource()>    }> +  if (source == "csv" && 
is.null(options[["nullValue"]])) {
    
    I think na.strings works for read.table and not just for read.csv in R ? Is 
the concern that NA is not a good default for other formats like JSON etc. ?
    
    -
    You are receiving this because you authored the thread.
    Reply to this email directly, view it on 
GitHub<https://github.com/apache/spark/pull/13984/files/aaa67075f58f707c8dd47a98702689c1cc26bbc4#r69487213>,
 or mute the 
thread<https://github.com/notifications/unsubscribe/AIjc-7ThfeXSxN3_zm5STkeq1MwKyMNAks5qSV_0gaJpZM4JBqBX>.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #13984: [SPARK-16310][SPARKR] R na.string-like default fo...

Reply via email to