[ https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ahmed ZAROUI updated SPARK-23448: --------------------------------- Description: I have the following json file that contains some noisy data(String instead of Array): {code:java} {"attr1":"val1","attr2":"[\"val2\"]"} {"attr1":"val1","attr2":["val2"]} {code} And i need to specify schema programatically like this: {code:java} implicit val spark = SparkSession .builder() .master("local[*]") .config("spark.ui.enabled", false) .config("spark.sql.caseSensitive", "True") .getOrCreate() import spark.implicits._ val schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true))) spark.read.schema(schema).json(input).collect().foreach(println) {code} The result given by this code is: {code:java} [null,null] [val1,WrappedArray(val2)] {code} Instead of putting null in corrupted column, all columns of the first message are null was: I have the following json file that contains some noisy data(String instead of Array): {code:java} {"attr1":"val1","attr2":["val2"]} {"attr1":"val1","attr2":"[\"val2\"]"} {code} And i need to specify schema programatically like this: {code:java} implicit val spark = SparkSession .builder() .master("local[*]") .config("spark.ui.enabled", false) .config("spark.sql.caseSensitive", "True") .getOrCreate() import spark.implicits._ val schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true))) spark.read.schema(schema).json(input).collect().foreach(println) {code} The result given by this code is: {code:java} [null,null] [val1,WrappedArray(val2)] {code} Instead of putting null in corrupted column, all columns of the first message are null > Data encoding problem when not finding the right type > ----------------------------------------------------- > > Key: SPARK-23448 > URL: https://issues.apache.org/jira/browse/SPARK-23448 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.0.2 > Environment: Tested locally in linux machine > Reporter: Ahmed ZAROUI > Priority: Major > > I have the following json file that contains some noisy data(String instead > of Array): > > {code:java} > {"attr1":"val1","attr2":"[\"val2\"]"} > {"attr1":"val1","attr2":["val2"]} > {code} > And i need to specify schema programatically like this: > > {code:java} > implicit val spark = SparkSession > .builder() > .master("local[*]") > .config("spark.ui.enabled", false) > .config("spark.sql.caseSensitive", "True") > .getOrCreate() > import spark.implicits._ > val > schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true))) > spark.read.schema(schema).json(input).collect().foreach(println) > {code} > The result given by this code is: > {code:java} > [null,null] > [val1,WrappedArray(val2)] > {code} > Instead of putting null in corrupted column, all columns of the first message > are null > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org