[ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Georges Perrin updated SPARK-26972:
----------------------------------------
    Description: 
 

 

Issue with CSV import and inferSchema set to true.

I found a few discrepencies while working with inferSchema set to true in CSV 
ingestion.

Given the following CSV in the attached books.csv:
{noformat}
id;authorId;title;releaseDate;link
1;1;Fantastic Beasts and Where to Find Them: The Original 
Screenplay;11/18/16;http://amzn.to/2kup94P
2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
Potter)*;12/4/08;http://amzn.to/2kYezqr
4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; 
the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
An independent study by Jean Georges Perrin, IIUG Board 
Member*;12/28/16;http://amzn.to/2vBxOe1
7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
11;4;Diderot Encyclopedia: The Complete Illustrations 
1762-1777;;http://amzn.to/2i2zo3I
12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
15;7;Soft Skills: The software developer's life 
manual;12/29/14;http://amzn.to/2zNnSyn
16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
programming*;8/28/14;http://amzn.to/2isdqoL
18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
20;14;*Fables choisies; mises en vers par M. de La 
Fontaine*;9/1/1999;http://amzn.to/2yRH10W
21;15;Discourse on Method and Meditations on First 
Philosophy;6/15/1999;http://amzn.to/2hwB8zc
22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
 

{{id;authorId;title;releaseDate;link}}
 {{1;1;Fantastic Beasts and Where to Find Them: The Original 
Scree}}{{nplay;11/18/16;[http://amzn.to/2kup94P]}}
 {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
Potter; Book 1)*;10/6/15;[http://amzn.to/2l2lSwP]}}
 {{3;1;The Tales of Beedle the B}}{{ard, Standard Edition (Harry 
Potter);12/4/08;[http://amzn.to/2kYezqr]}}
 {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
Potter; Book 2)*;10/4/16;[http://amzn.to/2kYhL5n]}}
 {{5;2;Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
Apple; the Coffee; and a Great Database;4/23/17;[http://amzn.to/2i3mthT]}}
 {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language?}}

{{An independent study by Jean Georges Perrin, IIUG Board 
Member*;12/28/16;[http://amzn.to/2vBxOe1]}}
 {{7;3;Adventures of Huckleberry Finn;5/26/94;[http://amzn.to/2wOeOav]}}
 {{8;3;A Connecticut Yankee in King Arthur's 
Court;6/17/17;[http://amzn.to/2x1NuoD]}}
 {{10;4;Jacques le Fataliste;3/1/00;[http://amzn.to/2uZj2KA]}}
 {{11;4;Diderot Encyclopedia: The Complete Illustrations 
1762-1777;;[http://amzn.to/2i2zo3I]}}
 {{12;;A Woman in Berlin;7/11/06;[http://amzn.to/2i472WZ]}}
 {{13;6;Spring Boot in Action;1/3/16;[http://amzn.to/2hCPktW]}}
 {{14;6;Spring in Action: Covers Spring 4;11/28/14;[http://amzn.to/2yJLyCk]}}
 {{15;7;Soft Skills: The software developer's life 
manual;12/29/14;[http://amzn.to/2zNnSyn]}}
 {{16;8;Of Mice and Men;;[http://amzn.to/2zJjXoc]}}
 {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
programming*;8/28/14;[http://amzn.to/2isdqoL]}}
 {{18;12;Hamlet;6/8/12;[http://amzn.to/2yRbewY]}}
 {{19;13;Pensées;12/31/1670;[http://amzn.to/2jweHOG]}}
 {{20;14;*Fables choisies; mises en vers par M. de La 
Fontaine*;9/1/1999;[http://amzn.to/2yRH10W]}}
 {{21;15;Discourse on Method and Meditations on First 
Philosophy;6/15/1999;[http://amzn.to/2hwB8zc]}}
 {{22;12;Twelfth Night;7/1/4;[http://amzn.to/2zPYnwo]}}
 {{23;12;Macbeth;7/1/3;[http://amzn.to/2zPYnwo]}}

And this code:

{{Dataset<Row> df = spark.read().format("csv")}}
 {{ .option("header", "true")}}
 {{ .option("multiline", true)}}
 {{ .option("sep", ";")}}
 {{ .option("quote", "*")}}
 {{ .option("dateFormat", "M/d/y")}}
 {{ .option("inferSchema", true)}}
 {{ .load("data/books.csv");}}
 {{df.show(7);}}
 {{df.printSchema();}}
h1. In Spark v2.0.1

{{Excerpt of the dataframe content:}}
 {{+-----+------+++-------------------------------+--------------------}}
 {{| id|authorId| title|releaseDate| link|}}
 {{+-----+------+++-------------------------------+--------------------}}
 {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
 {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
 {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
 {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
 {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
 {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
 {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
 {{+-----+------+++-------------------------------+--------------------}}
 {{only showing top 7 rows}}{{Dataframe's schema:}}
 {{root}}
 \{{ |-- id: integer (nullable = true)}}
 \{{ |-- authorId: integer (nullable = true)}}
 \{{ |-- title: string (nullable = true)}}
 \{{ |-- releaseDate: string (nullable = true)}}
 \{{ |-- link: string (nullable = true)}}

*This is fine and the expected output*.
h1. Using Apache Spark v2.1.3

Excerpt of the dataframe content:

{{+----------------------+------+++-------------------------------+--------------------}}
 \{{ | id|authorId| title|releaseDate| link|}}
 {{ 
+----------------------+------+++-------------------------------+--------------------}}
 \{{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
 \{{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
 \{{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
 \{{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
 \{{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
 \{{ | 6| 2|Development Tools...| null| null|}}
 \{{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}}
 {{ 
+----------------------+------+++-------------------------------+--------------------}}
 \{{ only showing top 7 rows}}{{Dataframe's schema:}}
 \{{ root}}
 \{{ |-- id: string (nullable = true)}}
 \{{ |-- authorId: string (nullable = true)}}
 \{{ |-- title: string (nullable = true)}}
 \{{ |-- releaseDate: string (nullable = true)}}
 \{{ |-- link: string (nullable = true)}}

The *multiline* option is *not recognized*. And, of course, the schema is wrong.
h1. Using Apache Spark v2.2.3

Excerpt of the dataframe content:

{{+-----+------+++-------------------------------+--------------------}}
 {{| id|authorId| title|releaseDate| link}}
 {{|}}
 {{+-----+------+++-------------------------------+--------------------}}
 {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
 {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
 {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
 {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
 {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
 {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
 {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
 {{+-----+------+++-------------------------------+--------------------}}
 {{only showing top 7 rows}}{{Dataframe's schema:}}
 {{root}}
 \{{ |-- id: integer (nullable = true)}}
 \{{ |-- authorId: integer (nullable = true)}}
 \{{ |-- title: string (nullable = true)}}
 \{{ |-- releaseDate: string (nullable = true)}}
 \{{ |-- link}}
 {{: string (nullable = true)}}

The *link* column *has a carriage return* at the end of its name. If I run and 
use:

{{df.show(7, 90);}}

I get:

{{+-----+------+++-----------------------------------------------------------------------------------------------------+-----------------------}}
 {{| id|authorId| title|releaseDate| link}}
 {{|}}
 
{{+-----+------+++-----------------------------------------------------------------------------------------------------+-----------------------}}
 {{| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 
11/18/16|http://amzn.to/2kup94P}}
 {{|}}
 {{| 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition 
(Harry Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP}}
 {{|}}
 {{| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 
12/4/08|http://amzn.to/2kYezqr}}
 {{|}}
 {{| 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition 
(Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n}}
 {{|}}
 {{| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT}}
 {{|}}
 \{{| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? }}
 {{An independent study by...| 12/28/16|http://amzn.to/2vBxOe1}}
 {{|}}
 {{| 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav}}
 {{|}}
 
{{+-----+------+++-----------------------------------------------------------------------------------------------------+-----------------------}}

The carriage *return is added to my the last cell*.

Same behavior in v2.3.3 and v2.4.0.

If I add the schema, like in:

{{StructType schema = DataTypes.createStructType(new StructField[] {}}
 \{{ DataTypes.createStructField(}}
 \{{ "id",}}
 \{{ DataTypes.IntegerType,}}
 \{{ false),}}
 \{{ DataTypes.createStructField(}}
 \{{ "authordId",}}
 \{{ DataTypes.IntegerType,}}
 \{{ true),}}
 \{{ DataTypes.createStructField(}}
 \{{ "bookTitle",}}
 \{{ DataTypes.StringType,}}
 \{{ false),}}
 \{{ DataTypes.createStructField(}}
 \{{ "releaseDate",}}
 \{{ DataTypes.DateType,}}
 \{{ true), // nullable, but this will be ignore}}
 \{{ DataTypes.createStructField(}}
 \{{ "url",}}
 \{{ DataTypes.StringType,}}
 {

{ false) }

);}}

{{// Reads a CSV file with header, called books.csv, stores it in a dataframe}}
 {{Dataset<Row> df = spark.read().format("csv")}}
 \{{ .option("header", "true")}}
 \{{ .option("multiline", true)}}
 \{{ .option("sep", ";")}}
 \{{ .option("dateFormat", "M/d/y")}}
 \{{ .option("quote", "*")}}
 \{{ .schema(schema)}}
 \{{ .load("data/books.csv");}}

The output is matching what is expected in any version *except version 2.1.3, 
where Spark simply crashes*.

All the code can be downloaded from GitHub at: 
[https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.]

 

 

  was:
 

 

Issue with CSV import and inferSchema set to true.

I found a few discrepencies while working with inferSchema set to true in CSV 
ingestion.

Given the following CSV:

{{id;authorId;title;releaseDate;link}}
{{1;1;Fantastic Beasts and Where to Find Them: The Original 
Scree}}{{nplay;11/18/16;[http://amzn.to/2kup94P]}}
{{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
Potter; Book 1)*;10/6/15;[http://amzn.to/2l2lSwP]}}
{{3;1;The Tales of Beedle the B}}{{ard, Standard Edition (Harry 
Potter);12/4/08;[http://amzn.to/2kYezqr]}}
{{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
Potter; Book 2)*;10/4/16;[http://amzn.to/2kYhL5n]}}
{{5;2;Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; 
the Coffee; and a Great Database;4/23/17;[http://amzn.to/2i3mthT]}}
{{6;2;*Development Tools in 2006: any Room for a 4GL-style Language?}}

{{An independent study by Jean Georges Perrin, IIUG Board 
Member*;12/28/16;[http://amzn.to/2vBxOe1]}}
{{7;3;Adventures of Huckleberry Finn;5/26/94;[http://amzn.to/2wOeOav]}}
{{8;3;A Connecticut Yankee in King Arthur's 
Court;6/17/17;[http://amzn.to/2x1NuoD]}}
{{10;4;Jacques le Fataliste;3/1/00;[http://amzn.to/2uZj2KA]}}
{{11;4;Diderot Encyclopedia: The Complete Illustrations 
1762-1777;;[http://amzn.to/2i2zo3I]}}
{{12;;A Woman in Berlin;7/11/06;[http://amzn.to/2i472WZ]}}
{{13;6;Spring Boot in Action;1/3/16;[http://amzn.to/2hCPktW]}}
{{14;6;Spring in Action: Covers Spring 4;11/28/14;[http://amzn.to/2yJLyCk]}}
{{15;7;Soft Skills: The software developer's life 
manual;12/29/14;[http://amzn.to/2zNnSyn]}}
{{16;8;Of Mice and Men;;[http://amzn.to/2zJjXoc]}}
{{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
programming*;8/28/14;[http://amzn.to/2isdqoL]}}
{{18;12;Hamlet;6/8/12;[http://amzn.to/2yRbewY]}}
{{19;13;Pensées;12/31/1670;[http://amzn.to/2jweHOG]}}
{{20;14;*Fables choisies; mises en vers par M. de La 
Fontaine*;9/1/1999;[http://amzn.to/2yRH10W]}}
{{21;15;Discourse on Method and Meditations on First 
Philosophy;6/15/1999;[http://amzn.to/2hwB8zc]}}
{{22;12;Twelfth Night;7/1/4;[http://amzn.to/2zPYnwo]}}
{{23;12;Macbeth;7/1/3;[http://amzn.to/2zPYnwo]}}

And this code:

{{Dataset<Row> df = spark.read().format("csv")}}
 {{ .option("header", "true")}}
 {{ .option("multiline", true)}}
 {{ .option("sep", ";")}}
 {{ .option("quote", "*")}}
 {{ .option("dateFormat", "M/d/y")}}
 {{ .option("inferSchema", true)}}
 {{ .load("data/books.csv");}}
 {{df.show(7);}}
 {{df.printSchema();}}
h1. In Spark v2.0.1

{{Excerpt of the dataframe content:}}
 {{+----+-------++-------------------------------++--------------------}}
 {{| id|authorId| title|releaseDate| link|}}
 {{+----+-------++-------------------------------++--------------------}}
 {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
 {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
 {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
 {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
 {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
 {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
 {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
 {{+----+-------++-------------------------------++--------------------}}
 {{only showing top 7 rows}}{{Dataframe's schema:}}
 {{root}}
 \{{ |-- id: integer (nullable = true)}}
 \{{ |-- authorId: integer (nullable = true)}}
 \{{ |-- title: string (nullable = true)}}
 \{{ |-- releaseDate: string (nullable = true)}}
 \{{ |-- link: string (nullable = true)}}

*This is fine and the expected output*.
h1. Using Apache Spark v2.1.3

Excerpt of the dataframe content:

{{+---------------------+-------++-------------------------------++--------------------}}
 \{{ | id|authorId| title|releaseDate| link|}}
 {{ 
+---------------------+-------++-------------------------------++--------------------}}
 \{{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
 \{{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
 \{{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
 \{{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
 \{{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
 \{{ | 6| 2|Development Tools...| null| null|}}
 \{{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}}
 {{ 
+---------------------+-------++-------------------------------++--------------------}}
 \{{ only showing top 7 rows}}{{Dataframe's schema:}}
 \{{ root}}
 \{{ |-- id: string (nullable = true)}}
 \{{ |-- authorId: string (nullable = true)}}
 \{{ |-- title: string (nullable = true)}}
 \{{ |-- releaseDate: string (nullable = true)}}
 \{{ |-- link: string (nullable = true)}}

The *multiline* option is *not recognized*. And, of course, the schema is wrong.
h1. Using Apache Spark v2.2.3

Excerpt of the dataframe content:

{{+----+-------++-------------------------------++--------------------}}
 {{| id|authorId| title|releaseDate| link}}
 {{|}}
 {{+----+-------++-------------------------------++--------------------}}
 {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
 {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
 {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
 {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
 {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
 {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
 {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
 {{+----+-------++-------------------------------++--------------------}}
 {{only showing top 7 rows}}{{Dataframe's schema:}}
 {{root}}
 \{{ |-- id: integer (nullable = true)}}
 \{{ |-- authorId: integer (nullable = true)}}
 \{{ |-- title: string (nullable = true)}}
 \{{ |-- releaseDate: string (nullable = true)}}
 \{{ |-- link}}
 {{: string (nullable = true)}}

The *link* column *has a carriage return* at the end of its name. If I run and 
use:

{{df.show(7, 90);}}

I get:

{{+----+-------++-----------------------------------------------------------------------------------------------------++-----------------------}}
 {{| id|authorId| title|releaseDate| link}}
 {{|}}
 
{{+----+-------++-----------------------------------------------------------------------------------------------------++-----------------------}}
 {{| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 
11/18/16|http://amzn.to/2kup94P}}
 {{|}}
 {{| 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition 
(Harry Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP}}
 {{|}}
 {{| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 
12/4/08|http://amzn.to/2kYezqr}}
 {{|}}
 {{| 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition 
(Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n}}
 {{|}}
 {{| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT}}
 {{|}}
 \{{| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? }}
 {{An independent study by...| 12/28/16|http://amzn.to/2vBxOe1}}
 {{|}}
 {{| 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav}}
 {{|}}
 
{{+----+-------++-----------------------------------------------------------------------------------------------------++-----------------------}}

The carriage *return is added to my the last cell*.

Same behavior in v2.3.3 and v2.4.0.

If I add the schema, like in:

{{StructType schema = DataTypes.createStructType(new StructField[] {}}
 \{{ DataTypes.createStructField(}}
 \{{ "id",}}
 \{{ DataTypes.IntegerType,}}
 \{{ false),}}
 \{{ DataTypes.createStructField(}}
 \{{ "authordId",}}
 \{{ DataTypes.IntegerType,}}
 \{{ true),}}
 \{{ DataTypes.createStructField(}}
 \{{ "bookTitle",}}
 \{{ DataTypes.StringType,}}
 \{{ false),}}
 \{{ DataTypes.createStructField(}}
 \{{ "releaseDate",}}
 \{{ DataTypes.DateType,}}
 \{{ true), // nullable, but this will be ignore}}
 \{{ DataTypes.createStructField(}}
 \{{ "url",}}
 \{{ DataTypes.StringType,}}
 \{{ false) });}}

{{// Reads a CSV file with header, called books.csv, stores it in a dataframe}}
 {{Dataset<Row> df = spark.read().format("csv")}}
 \{{ .option("header", "true")}}
 \{{ .option("multiline", true)}}
 \{{ .option("sep", ";")}}
 \{{ .option("dateFormat", "M/d/y")}}
 \{{ .option("quote", "*")}}
 \{{ .schema(schema)}}
 \{{ .load("data/books.csv");}}

The output is matching what is expected in any version *except version 2.1.3, 
where Spark simply crashes*.

All the code can be downloaded from GitHub at: 
[https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.]

 

 


> Issue with CSV import and inferSchema set to true
> -------------------------------------------------
>
>                 Key: SPARK-26972
>                 URL: https://issues.apache.org/jira/browse/SPARK-26972
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.1.3, 2.3.3, 2.4.0
>         Environment: Java 8/Scala 2.11/MacOs
>            Reporter: Jean Georges Perrin
>            Priority: Major
>         Attachments: ComplexCsvToDataframeApp.java, 
> ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml
>
>
>  
>  
> Issue with CSV import and inferSchema set to true.
> I found a few discrepencies while working with inferSchema set to true in CSV 
> ingestion.
> Given the following CSV in the attached books.csv:
> {noformat}
> id;authorId;title;releaseDate;link
> 1;1;Fantastic Beasts and Where to Find Them: The Original 
> Screenplay;11/18/16;http://amzn.to/2kup94P
> 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
> Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
> 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
> Potter)*;12/4/08;http://amzn.to/2kYezqr
> 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
> Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
> 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
> 6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
> An independent study by Jean Georges Perrin, IIUG Board 
> Member*;12/28/16;http://amzn.to/2vBxOe1
> 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
> 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
> 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
> 11;4;Diderot Encyclopedia: The Complete Illustrations 
> 1762-1777;;http://amzn.to/2i2zo3I
> 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
> 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
> 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
> 15;7;Soft Skills: The software developer's life 
> manual;12/29/14;http://amzn.to/2zNnSyn
> 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
> 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
> programming*;8/28/14;http://amzn.to/2isdqoL
> 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
> 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
> 20;14;*Fables choisies; mises en vers par M. de La 
> Fontaine*;9/1/1999;http://amzn.to/2yRH10W
> 21;15;Discourse on Method and Meditations on First 
> Philosophy;6/15/1999;http://amzn.to/2hwB8zc
> 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
> 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
>  
> {{id;authorId;title;releaseDate;link}}
>  {{1;1;Fantastic Beasts and Where to Find Them: The Original 
> Scree}}{{nplay;11/18/16;[http://amzn.to/2kup94P]}}
>  {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
> Potter; Book 1)*;10/6/15;[http://amzn.to/2l2lSwP]}}
>  {{3;1;The Tales of Beedle the B}}{{ard, Standard Edition (Harry 
> Potter);12/4/08;[http://amzn.to/2kYezqr]}}
>  {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition 
> (Harry Potter; Book 2)*;10/4/16;[http://amzn.to/2kYhL5n]}}
>  {{5;2;Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; and a Great Database;4/23/17;[http://amzn.to/2i3mthT]}}
>  {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language?}}
> {{An independent study by Jean Georges Perrin, IIUG Board 
> Member*;12/28/16;[http://amzn.to/2vBxOe1]}}
>  {{7;3;Adventures of Huckleberry Finn;5/26/94;[http://amzn.to/2wOeOav]}}
>  {{8;3;A Connecticut Yankee in King Arthur's 
> Court;6/17/17;[http://amzn.to/2x1NuoD]}}
>  {{10;4;Jacques le Fataliste;3/1/00;[http://amzn.to/2uZj2KA]}}
>  {{11;4;Diderot Encyclopedia: The Complete Illustrations 
> 1762-1777;;[http://amzn.to/2i2zo3I]}}
>  {{12;;A Woman in Berlin;7/11/06;[http://amzn.to/2i472WZ]}}
>  {{13;6;Spring Boot in Action;1/3/16;[http://amzn.to/2hCPktW]}}
>  {{14;6;Spring in Action: Covers Spring 4;11/28/14;[http://amzn.to/2yJLyCk]}}
>  {{15;7;Soft Skills: The software developer's life 
> manual;12/29/14;[http://amzn.to/2zNnSyn]}}
>  {{16;8;Of Mice and Men;;[http://amzn.to/2zJjXoc]}}
>  {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
> programming*;8/28/14;[http://amzn.to/2isdqoL]}}
>  {{18;12;Hamlet;6/8/12;[http://amzn.to/2yRbewY]}}
>  {{19;13;Pensées;12/31/1670;[http://amzn.to/2jweHOG]}}
>  {{20;14;*Fables choisies; mises en vers par M. de La 
> Fontaine*;9/1/1999;[http://amzn.to/2yRH10W]}}
>  {{21;15;Discourse on Method and Meditations on First 
> Philosophy;6/15/1999;[http://amzn.to/2hwB8zc]}}
>  {{22;12;Twelfth Night;7/1/4;[http://amzn.to/2zPYnwo]}}
>  {{23;12;Macbeth;7/1/3;[http://amzn.to/2zPYnwo]}}
> And this code:
> {{Dataset<Row> df = spark.read().format("csv")}}
>  {{ .option("header", "true")}}
>  {{ .option("multiline", true)}}
>  {{ .option("sep", ";")}}
>  {{ .option("quote", "*")}}
>  {{ .option("dateFormat", "M/d/y")}}
>  {{ .option("inferSchema", true)}}
>  {{ .load("data/books.csv");}}
>  {{df.show(7);}}
>  {{df.printSchema();}}
> h1. In Spark v2.0.1
> {{Excerpt of the dataframe content:}}
>  {{+-----+------+++-------------------------------+--------------------}}
>  {{| id|authorId| title|releaseDate| link|}}
>  {{+-----+------+++-------------------------------+--------------------}}
>  {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
>  {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
>  {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
>  {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
>  {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
>  {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
>  {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
>  {{+-----+------+++-------------------------------+--------------------}}
>  {{only showing top 7 rows}}{{Dataframe's schema:}}
>  {{root}}
>  \{{ |-- id: integer (nullable = true)}}
>  \{{ |-- authorId: integer (nullable = true)}}
>  \{{ |-- title: string (nullable = true)}}
>  \{{ |-- releaseDate: string (nullable = true)}}
>  \{{ |-- link: string (nullable = true)}}
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content:
> {{+----------------------+------+++-------------------------------+--------------------}}
>  \{{ | id|authorId| title|releaseDate| link|}}
>  {{ 
> +----------------------+------+++-------------------------------+--------------------}}
>  \{{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
>  \{{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
>  \{{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
>  \{{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
>  \{{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
>  \{{ | 6| 2|Development Tools...| null| null|}}
>  \{{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}}
>  {{ 
> +----------------------+------+++-------------------------------+--------------------}}
>  \{{ only showing top 7 rows}}{{Dataframe's schema:}}
>  \{{ root}}
>  \{{ |-- id: string (nullable = true)}}
>  \{{ |-- authorId: string (nullable = true)}}
>  \{{ |-- title: string (nullable = true)}}
>  \{{ |-- releaseDate: string (nullable = true)}}
>  \{{ |-- link: string (nullable = true)}}
> The *multiline* option is *not recognized*. And, of course, the schema is 
> wrong.
> h1. Using Apache Spark v2.2.3
> Excerpt of the dataframe content:
> {{+-----+------+++-------------------------------+--------------------}}
>  {{| id|authorId| title|releaseDate| link}}
>  {{|}}
>  {{+-----+------+++-------------------------------+--------------------}}
>  {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
>  {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
>  {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
>  {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
>  {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
>  {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
>  {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
>  {{+-----+------+++-------------------------------+--------------------}}
>  {{only showing top 7 rows}}{{Dataframe's schema:}}
>  {{root}}
>  \{{ |-- id: integer (nullable = true)}}
>  \{{ |-- authorId: integer (nullable = true)}}
>  \{{ |-- title: string (nullable = true)}}
>  \{{ |-- releaseDate: string (nullable = true)}}
>  \{{ |-- link}}
>  {{: string (nullable = true)}}
> The *link* column *has a carriage return* at the end of its name. If I run 
> and use:
> {{df.show(7, 90);}}
> I get:
> {{+-----+------+++-----------------------------------------------------------------------------------------------------+-----------------------}}
>  {{| id|authorId| title|releaseDate| link}}
>  {{|}}
>  
> {{+-----+------+++-----------------------------------------------------------------------------------------------------+-----------------------}}
>  {{| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 
> 11/18/16|http://amzn.to/2kup94P}}
>  {{|}}
>  {{| 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition 
> (Harry Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP}}
>  {{|}}
>  {{| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 
> 12/4/08|http://amzn.to/2kYezqr}}
>  {{|}}
>  {{| 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition 
> (Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n}}
>  {{|}}
>  {{| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT}}
>  {{|}}
>  \{{| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? }}
>  {{An independent study by...| 12/28/16|http://amzn.to/2vBxOe1}}
>  {{|}}
>  {{| 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav}}
>  {{|}}
>  
> {{+-----+------+++-----------------------------------------------------------------------------------------------------+-----------------------}}
> The carriage *return is added to my the last cell*.
> Same behavior in v2.3.3 and v2.4.0.
> If I add the schema, like in:
> {{StructType schema = DataTypes.createStructType(new StructField[] {}}
>  \{{ DataTypes.createStructField(}}
>  \{{ "id",}}
>  \{{ DataTypes.IntegerType,}}
>  \{{ false),}}
>  \{{ DataTypes.createStructField(}}
>  \{{ "authordId",}}
>  \{{ DataTypes.IntegerType,}}
>  \{{ true),}}
>  \{{ DataTypes.createStructField(}}
>  \{{ "bookTitle",}}
>  \{{ DataTypes.StringType,}}
>  \{{ false),}}
>  \{{ DataTypes.createStructField(}}
>  \{{ "releaseDate",}}
>  \{{ DataTypes.DateType,}}
>  \{{ true), // nullable, but this will be ignore}}
>  \{{ DataTypes.createStructField(}}
>  \{{ "url",}}
>  \{{ DataTypes.StringType,}}
>  {
> { false) }
> );}}
> {{// Reads a CSV file with header, called books.csv, stores it in a 
> dataframe}}
>  {{Dataset<Row> df = spark.read().format("csv")}}
>  \{{ .option("header", "true")}}
>  \{{ .option("multiline", true)}}
>  \{{ .option("sep", ";")}}
>  \{{ .option("dateFormat", "M/d/y")}}
>  \{{ .option("quote", "*")}}
>  \{{ .schema(schema)}}
>  \{{ .load("data/books.csv");}}
> The output is matching what is expected in any version *except version 2.1.3, 
> where Spark simply crashes*.
> All the code can be downloaded from GitHub at: 
> [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to