[jira] [Updated] (SPARK-30149) Schema Definition Spark Read

Joby Joje (Jira) Fri, 06 Dec 2019 03:13:23 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-30149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joby Joje updated SPARK-30149:
------------------------------
    Attachment:     (was: Schema.py)

> Schema Definition Spark Read
> ----------------------------
>
>                 Key: SPARK-30149
>                 URL: https://issues.apache.org/jira/browse/SPARK-30149
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Joby Joje
>            Priority: Blocker
>
> Reading a CSV file with defined schema I am able to load the files and do the 
> processing, which works fine using the below code. The schema is defined as 
> to strictly follow the datatype to record precision's with accuracy.
> {code:java}
> source_schema = StructType([source_schema = StructType([ StructField("NAME", 
> StringType(), True), StructField("AGE", StringType(), True), 
> StructField("GENDER", StringType(), True), StructField("PROFESSION", 
> StringType(), True), StructField("SALARY", DecimalType(38, 14), True), 
> StructField("BAD_RECORD", StringType(), True)]){code}
> {code:java}
> df_raw_file = sparksession.read \ 
> .format("csv") \ 
> .option("delimiter", '\t') \ 
> .option("header", "false") \ 
> .option("inferSchema", "true") \ 
> .option("columnNameOfCorruptRecord", "BAD_RECORD") \ 
> .schema(source_schema) \ .load(in_file_list) \ 
> .withColumn("LINE_NUMBER", monotonically_increasing_id()) \ 
> .withColumn("SOURCE_FILE_NAME", input_file_name()){code}
>  As per the Spark Documentation the mode is {{PERMISSIVE}} by default is its 
> not set and A record with less/more tokens than schema is not a corrupted 
> record to CSV. When it meets a record having fewer tokens than the length of 
> the schema, sets {{null}} to extra fields. When the record has more tokens 
> than the length of the schema, it drops extra tokens.
> {code:java}
> FILE SCHEMA TEST.CSVFILE SCHEMA TEST.CSV
> root 
> |-- NAME: string (nullable = true) 
> |-- AGE: string (nullable = true) 
> |-- GENDER: string (nullable = true) 
> |-- PROFESSION: string (nullable = true) 
> |-- SALARY: decimal(38,14) (nullable = true) 
> |-- BAD_RECORD: string (nullable = true) 
> |-- LINE_NUMBER: long (nullable = false) 
> |-- SOURCE_FILE_NAME: string (nullable = false)
> OUTPUT THE FILE TEST.CSV
> +------+----+------+----------+------------------+---------------------------------+-----------+-----------------+
> |NAME  |AGE |GENDER|PROFESSION|SALARY            |BAD_RECORD                  
>      |LINE_NUMBER|SOURCE_FILE_NAME |
> +------+----+------+----------+------------------+---------------------------------+-----------+-----------------+
> |null  |null|null  |null      |null              |NAME        AGE     GENDER  
> PROFESSION      SALARY|0          |Test.CSV|
> |JOHN  |27  |MALE  |CEO       |300.12314234500000|null                        
>      |1          |Test.CSV|
> |JUSTIN|67  |MALE  |CTO       |123.23453543450000|null                        
>      |2          |Test.CSV|
> |SARAH |45  |FEMALE|CS        |null              |null                        
>      |3          |Test.CSV|
> |SEAN  |66  |MALE  |CA        |null              |SEAN        66      MALE    
> CA                  |4          |Test.CSV|
> |PHIL  |34  |MALE  |null      |234.98698600000000|null                        
>      |5          |Test.CSV|
> |null  |null|null  |null      |null              |JILL        25              
>         BOARD                    |6          |Test.CSV|
> |JACK  |30  |MALE  |BOARD     |null              |JACK        30      MALE    
> BOARD               |7          |Test.CSV|
> +------+----+------+----------+------------------+---------------------------------+-----------+-----------------+{code}
>  
> The TEST1.CSV doesnt have the SALARY column so it should have NULLED the 
> column and the BAD_RECORD column should be NULL for the rows, that doesnt 
> seem to happen and the values are considered as CORRUPT. 
> Also when it meets a corrupted record, puts the malformed string into a field 
> configured by {{columnNameOfCorruptRecord}}, and sets other fields to 
> {{null}} this is also not happening and I see this happening only for the 
> JILL row.
> {code:java}
> FILE SCHEMA TEST1.CSV 
> root 
> |-- NAME: string (nullable = true) 
> |-- AGE: string (nullable = true) 
> |-- GENDER: string (nullable = true) 
> |-- PROFESSION: string (nullable = true) 
> |-- SALARY: decimal(38,14) (nullable = true) 
> |-- BAD_RECORD: string (nullable = true) 
> |-- LINE_NUMBER: long (nullable = false) 
> |-- SOURCE_FILE_NAME: string (nullable = false)
> OUTPUT THE FILE TEST1.CSV
> +------+----+------+----------+------+--------------------------+-----------+-----------------+
>  |NAME |AGE |GENDER|PROFESSION|SALARY|BAD_RECORD 
> |LINE_NUMBER|SOURCE_FILE_NAME | 
> +------+----+------+----------+------+--------------------------+-----------+-----------------+
>  
> |NAME |AGE |GENDER|PROFESSION|null |NAME AGE GENDER PROFESSION|0 |Test1.CSV| 
> |JOHN |27 |MALE |CEO |null |JOHN 27 MALE CEO |1 |Test1.CSV| 
> |JUSTIN|67 |MALE |CTO |null |JUSTIN 67 MALE CTO |2 |Test1.CSV| 
> |SARAH |45 |FEMALE|CS |null |SARAH 45 FEMALE CS |3 |Test1.CSV| 
> |SEAN |66 |MALE |CA |null |SEAN 66 MALE CA |4 |Test1.CSV| 
> |PHIL |34 |MALE |null |null |PHIL 34 MALE |5 |Test1.CSV| 
> |null |null|null |null |null |JILL 25 BOARD |6 |Test1.CSV| 
> |JACK |30 |MALE |BOARD |null |JACK 30 MALE BOARD |7 |Test1.CSV| 
> +------+----+------+----------+------+--------------------------+-----------+-----------------+{code}
>  Attached the code, output and source files used for test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-30149) Schema Definition Spark Read

Reply via email to