[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...

HyukjinKwon Wed, 08 Mar 2017 01:28:05 -0800

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17178#discussion_r104876261
  
    --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
    @@ -1342,28 +1342,52 @@ test_that("column functions", {
       df <- read.json(mapTypeJsonPath)
       j <- collect(select(df, alias(to_json(df$info), "json")))
       expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}")
    -  df <- as.DataFrame(j)
    -  schema <- structType(structField("age", "integer"),
    -                       structField("height", "double"))
    -  s <- collect(select(df, alias(from_json(df$json, schema), "structcol")))
    -  expect_equal(ncol(s), 1)
    -  expect_equal(nrow(s), 3)
    -  expect_is(s[[1]][[1]], "struct")
    -  expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } )))
    -
    -  # passing option
    -  df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
    -  schema2 <- structType(structField("date", "date"))
    -  expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))),
    -                        error = function(e) { stop(e) }),
    -               paste0(".*(java.lang.NumberFormatException: For input 
string:).*"))
    -  s <- collect(select(df, from_json(df$col, schema2, dateFormat = 
"dd/MM/yyyy")))
    -  expect_is(s[[1]][[1]]$date, "Date")
    -  expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21")
    -
    -  # check for unparseable
    -  df <- as.DataFrame(list(list("a" = "")))
    -  expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA)
    +
    +  schemas <- list(structType(structField("age", "integer"), 
structField("height", "double")),
    +                  "struct<age:integer,height:double>")
    --- End diff --
    
    Hm, @felixcheung, I think this resembles catalog string, maybe we could 
reuse `CatalystSqlParser.parseDataType` to make this more formal and to do not 
duplicate the efforts for defining a format or documentation. This is a big 
change but if this is what we want in the future, I would like to argue that we 
should keep this way.
    
    For JSON string schema, there is an overloaded version of `from_json` that 
takes that schema string. If we are going to expose it, it can be easily done.
    
    However, I think you meant it is a bigger change because we need to provide 
a way to produce this JSON string from types. Up to my knowledge, we can only 
manually specify the schema via this calalog string. Is this true? If so, I 
don't have a good idea for now to support this and I would rather close this if 
you so as well.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...

Reply via email to