[ 
https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37981.
----------------------------------------
    Resolution: Duplicate

> Deletes columns with all Null as default.
> -----------------------------------------
>
>                 Key: SPARK-37981
>                 URL: https://issues.apache.org/jira/browse/SPARK-37981
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.0
>            Reporter: Bjørn Jørgensen
>            Priority: Major
>         Attachments: json_null.json
>
>
> Spark 3.2.1-RC2 
> During write.json spark deletes columns with all Null as default. 
>  
> Spark does have dropFieldIfAllNull    false as default, according to 
> https://spark.apache.org/docs/latest/sql-data-sources-json.html
> {code:java}
> from pyspark import pandas as ps
> import re
> import numpy as np
> import os
> import pandas as pd
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
> from pyspark.sql.types import StructType, StructField, StringType,IntegerType
> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
> def get_spark_session(app_name: str, conf: SparkConf):
>     conf.setMaster('local[*]')
>     conf \
>       .set('spark.driver.memory', '64g')\
>       .set("fs.s3a.access.key", "minio") \
>       .set("fs.s3a.secret.key", "") \
>       .set("fs.s3a.endpoint", "http://192.168.1.127:9000";) \
>       .set("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>       .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>       .set("spark.sql.repl.eagerEval.enabled", "True") \
>       .set("spark.sql.adaptive.enabled", "True") \
>       .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
>       .set("spark.sql.repl.eagerEval.maxNumRows", "10000") \
>       .set("sc.setLogLevel", "error")
>    
>     return 
> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
> spark = get_spark_session("Falk", SparkConf())
> d3 = 
> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
> import pyspark
> def sparkShape(dataFrame):
>     return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 267)
> d3.write.json("d3.json")
> d3 = spark.read.json("d3.json/*.json")
> import pyspark
> def sparkShape(dataFrame):
>     return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 186)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to