[
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean Georges Perrin updated SPARK-26972:
----------------------------------------
Description:
Issue with CSV import and inferSchema set to true.
I found a few discrepencies while working with inferSchema set to true in CSV
ingestion.
Given the following CSV in the attached books.csv:
{noformat}
id;authorId;title;releaseDate;link
1;1;Fantastic Beasts and Where to Find Them: The Original
Screenplay;11/18/16;http://amzn.to/2kup94P
2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry
Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
3;1;*The Tales of Beedle the Bard, Standard Edition (Harry
Potter)*;12/4/08;http://amzn.to/2kYezqr
4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry
Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple;
the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
An independent study by Jean Georges Perrin, IIUG Board
Member*;12/28/16;http://amzn.to/2vBxOe1
7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
11;4;Diderot Encyclopedia: The Complete Illustrations
1762-1777;;http://amzn.to/2i2zo3I
12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
15;7;Soft Skills: The software developer's life
manual;12/29/14;http://amzn.to/2zNnSyn
16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
17;9;*Java 8 in Action: Lambdas; Streams; and functional-style
programming*;8/28/14;http://amzn.to/2isdqoL
18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
20;14;*Fables choisies; mises en vers par M. de La
Fontaine*;9/1/1999;http://amzn.to/2yRH10W
21;15;Discourse on Method and Meditations on First
Philosophy;6/15/1999;http://amzn.to/2hwB8zc
22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
And this Java code:
{code:java}
Dataset<Row> df = spark.read().format("csv")
.option("header", "true")
.option("multiline", true)
.option("sep", ";")
.option("quote", "*")
.option("dateFormat", "M/d/y")
.option("inferSchema", true)
.load("data/books.csv");
df.show(7);
df.printSchema();
{code}
h1. In Spark v2.0.1
Output:
{noformat}
+---+--------+--------------------+-----------+--------------------+
| id|authorId| title|releaseDate| link|
+---+--------+--------------------+-----------+--------------------+
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|
| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|
+---+--------+--------------------+-----------+--------------------+
only showing top 7 rows
Dataframe's schema:
root
|-- id: integer (nullable = true)
|-- authorId: integer (nullable = true)
|-- title: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- link: string (nullable = true)
{noformat}
*This is fine and the expected output*.
h1. Using Apache Spark v2.1.3
Excerpt of the dataframe content:
{noformat}
+--------------------+--------+--------------------+-----------+--------------------+
| id|authorId| title|releaseDate| link|
+--------------------+--------+--------------------+-----------+--------------------+
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
| 6| 2|Development Tools...| null| null|
|An independent st...|12/28/16|http://amzn.to/2v...| null| null|
+--------------------+--------+--------------------+-----------+--------------------+
only showing top 7 rows
Dataframe's schema:
root
|-- id: string (nullable = true)
|-- authorId: string (nullable = true)
|-- title: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- link: string (nullable = true){noformat}
The *multiline* option is *not recognized*. And, of course, the schema is wrong.
h1. Using Apache Spark v2.2.3
Excerpt of the dataframe content:
{noformat}
+---+--------+--------------------+-----------+--------------------+
| id|authorId| title|releaseDate| link
|
+---+--------+--------------------+-----------+--------------------+
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|
| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|
+---+--------+--------------------+-----------+--------------------+
only showing top 7 rows
Dataframe's schema:
root
|-- id: integer (nullable = true)
|-- authorId: integer (nullable = true)
|-- title: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- link
: string (nullable = true)
{noformat}
The *link* column *has a carriage return* at the end of its name. If I run and
use:
{code:java}
df.show(7, 90);
{code}
I get:
{noformat}
+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+
| id|authorId| title|releaseDate| link
|
+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay|
11/18/16|http://amzn.to/2kup94P
|
| 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry
Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP
|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)|
12/4/08|http://amzn.to/2kYezqr
|
| 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry
Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n
|
| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the
Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT
|
| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language?
An independent study by...| 12/28/16|http://amzn.to/2vBxOe1
|
| 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav
|
+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+
{noformat}
The carriage *return is added to my the last cell*.
Same behavior in v2.3.3 and v2.4.0.
If I add the schema, like in:
{code:java}
// Creates the schema
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"id",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"authordId",
DataTypes.IntegerType,
true),
DataTypes.createStructField(
"bookTitle",
DataTypes.StringType,
false),
DataTypes.createStructField(
"releaseDate",
DataTypes.DateType,
true), // nullable, but this will be ignore
DataTypes.createStructField(
"url",
DataTypes.StringType,
false) });
// GitHub version only: dumps the schema
SchemaInspector.print(schema);
// Reads a CSV file with header, called books.csv, stores it in a dataframe
Dataset<Row> df = spark.read().format("csv")
.option("header", "true")
.option("multiline", true)
.option("sep", ";")
.option("dateFormat", "M/d/y")
.option("quote", "*")
.schema(schema)
.load("data/books.csv");
{code}
The output is matching what is expected in any version *except version 2.1.3,
where Spark simply crashes*.
All the code can be downloaded from GitHub at:
[https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.]
was:
Issue with CSV import and inferSchema set to true.
I found a few discrepencies while working with inferSchema set to true in CSV
ingestion.
Given the following CSV in the attached books.csv:
{noformat}
id;authorId;title;releaseDate;link
1;1;Fantastic Beasts and Where to Find Them: The Original
Screenplay;11/18/16;http://amzn.to/2kup94P
2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry
Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
3;1;*The Tales of Beedle the Bard, Standard Edition (Harry
Potter)*;12/4/08;http://amzn.to/2kYezqr
4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry
Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple;
the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
An independent study by Jean Georges Perrin, IIUG Board
Member*;12/28/16;http://amzn.to/2vBxOe1
7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
11;4;Diderot Encyclopedia: The Complete Illustrations
1762-1777;;http://amzn.to/2i2zo3I
12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
15;7;Soft Skills: The software developer's life
manual;12/29/14;http://amzn.to/2zNnSyn
16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
17;9;*Java 8 in Action: Lambdas; Streams; and functional-style
programming*;8/28/14;http://amzn.to/2isdqoL
18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
20;14;*Fables choisies; mises en vers par M. de La
Fontaine*;9/1/1999;http://amzn.to/2yRH10W
21;15;Discourse on Method and Meditations on First
Philosophy;6/15/1999;http://amzn.to/2hwB8zc
22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
And this Java code:
{code:java}
Dataset<Row> df = spark.read().format("csv")
.option("header", "true")
.option("multiline", true)
.option("sep", ";")
.option("quote", "*")
.option("dateFormat", "M/d/y")
.option("inferSchema", true)
.load("data/books.csv");
df.show(7);
df.printSchema();
{code}
h1. In Spark v2.0.1
{code:java}
+---+--------+--------------------+-----------+--------------------+
| id|authorId| title|releaseDate| link|
+---+--------+--------------------+-----------+--------------------+
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|
| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|
+---+--------+--------------------+-----------+--------------------+
only showing top 7 rows
Dataframe's schema:
root
|-- id: integer (nullable = true)
|-- authorId: integer (nullable = true)
|-- title: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- link: string (nullable = true)
{code}
*This is fine and the expected output*.
h1. Using Apache Spark v2.1.3
Excerpt of the dataframe content:
{{+-----------------------+-----++++---------------------------------------------------}}
\{{ | id|authorId| title|releaseDate| link|}}
{{
+-----------------------+-----++++---------------------------------------------------}}
\{{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
\{{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
\{{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
\{{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
\{{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
\{{ | 6| 2|Development Tools...| null| null|}}
\{{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}}
{{
+-----------------------+-----++++---------------------------------------------------}}
\{{ only showing top 7 rows}}{{Dataframe's schema:}}
\{{ root}}
\{{ |-- id: string (nullable = true)}}
\{{ |-- authorId: string (nullable = true)}}
\{{ |-- title: string (nullable = true)}}
\{{ |-- releaseDate: string (nullable = true)}}
\{{ |-- link: string (nullable = true)}}
The *multiline* option is *not recognized*. And, of course, the schema is wrong.
h1. Using Apache Spark v2.2.3
Excerpt of the dataframe content:
{{+------+-----++++---------------------------------------------------}}
{{| id|authorId| title|releaseDate| link}}
{{|}}
{{+------+-----++++---------------------------------------------------}}
{{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
{{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
{{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
{{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
{{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
{{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
{{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
{{+------+-----++++---------------------------------------------------}}
{{only showing top 7 rows}}{{Dataframe's schema:}}
{{root}}
\{{ |-- id: integer (nullable = true)}}
\{{ |-- authorId: integer (nullable = true)}}
\{{ |-- title: string (nullable = true)}}
\{{ |-- releaseDate: string (nullable = true)}}
\{{ |-- link}}
{{: string (nullable = true)}}
The *link* column *has a carriage return* at the end of its name. If I run and
use:
{{df.show(7, 90);}}
I get:
{{+------+-----++++----------------------------------------------------------------------------------------------------------------------------}}
{{| id|authorId| title|releaseDate| link}}
{{|}}
{{+------+-----++++----------------------------------------------------------------------------------------------------------------------------}}
{{| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay|
11/18/16|http://amzn.to/2kup94P}}
{{|}}
{{| 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition
(Harry Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP}}
{{|}}
{{| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)|
12/4/08|http://amzn.to/2kYezqr}}
{{|}}
{{| 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition
(Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n}}
{{|}}
{{| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the
Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT}}
{{|}}
\{{| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? }}
{{An independent study by...| 12/28/16|http://amzn.to/2vBxOe1}}
{{|}}
{{| 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav}}
{{|}}
{{+------+-----++++----------------------------------------------------------------------------------------------------------------------------}}
The carriage *return is added to my the last cell*.
Same behavior in v2.3.3 and v2.4.0.
If I add the schema, like in:
{{StructType schema = DataTypes.createStructType(new StructField[] {}}
\{{ DataTypes.createStructField(}}
\{{ "id",}}
\{{ DataTypes.IntegerType,}}
\{{ false),}}
\{{ DataTypes.createStructField(}}
\{{ "authordId",}}
\{{ DataTypes.IntegerType,}}
\{{ true),}}
\{{ DataTypes.createStructField(}}
\{{ "bookTitle",}}
\{{ DataTypes.StringType,}}
\{{ false),}}
\{{ DataTypes.createStructField(}}
\{{ "releaseDate",}}
\{{ DataTypes.DateType,}}
\{{ true), // nullable, but this will be ignore}}
\{{ DataTypes.createStructField(}}
\{{ "url",}}
\{{ DataTypes.StringType,}}
{
{ false) }
);}}
{{// Reads a CSV file with header, called books.csv, stores it in a dataframe}}
{{Dataset<Row> df = spark.read().format("csv")}}
\{{ .option("header", "true")}}
\{{ .option("multiline", true)}}
\{{ .option("sep", ";")}}
\{{ .option("dateFormat", "M/d/y")}}
\{{ .option("quote", "*")}}
\{{ .schema(schema)}}
\{{ .load("data/books.csv");}}
The output is matching what is expected in any version *except version 2.1.3,
where Spark simply crashes*.
All the code can be downloaded from GitHub at:
[https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.]
> Issue with CSV import and inferSchema set to true
> -------------------------------------------------
>
> Key: SPARK-26972
> URL: https://issues.apache.org/jira/browse/SPARK-26972
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 2.1.3, 2.3.3, 2.4.0
> Environment: Java 8/Scala 2.11/MacOs
> Reporter: Jean Georges Perrin
> Priority: Major
> Attachments: ComplexCsvToDataframeApp.java,
> ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml
>
>
>
>
> Issue with CSV import and inferSchema set to true.
> I found a few discrepencies while working with inferSchema set to true in CSV
> ingestion.
> Given the following CSV in the attached books.csv:
> {noformat}
> id;authorId;title;releaseDate;link
> 1;1;Fantastic Beasts and Where to Find Them: The Original
> Screenplay;11/18/16;http://amzn.to/2kup94P
> 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry
> Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
> 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry
> Potter)*;12/4/08;http://amzn.to/2kYezqr
> 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry
> Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
> 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the
> Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
> 6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
> An independent study by Jean Georges Perrin, IIUG Board
> Member*;12/28/16;http://amzn.to/2vBxOe1
> 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
> 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
> 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
> 11;4;Diderot Encyclopedia: The Complete Illustrations
> 1762-1777;;http://amzn.to/2i2zo3I
> 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
> 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
> 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
> 15;7;Soft Skills: The software developer's life
> manual;12/29/14;http://amzn.to/2zNnSyn
> 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
> 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style
> programming*;8/28/14;http://amzn.to/2isdqoL
> 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
> 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
> 20;14;*Fables choisies; mises en vers par M. de La
> Fontaine*;9/1/1999;http://amzn.to/2yRH10W
> 21;15;Discourse on Method and Meditations on First
> Philosophy;6/15/1999;http://amzn.to/2hwB8zc
> 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
> 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
> And this Java code:
> {code:java}
> Dataset<Row> df = spark.read().format("csv")
> .option("header", "true")
> .option("multiline", true)
> .option("sep", ";")
> .option("quote", "*")
> .option("dateFormat", "M/d/y")
> .option("inferSchema", true)
> .load("data/books.csv");
> df.show(7);
> df.printSchema();
> {code}
> h1. In Spark v2.0.1
> Output:
>
> {noformat}
> +---+--------+--------------------+-----------+--------------------+
> | id|authorId| title|releaseDate| link|
> +---+--------+--------------------+-----------+--------------------+
> | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
> | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
> | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
> | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
> | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
> | 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|
> | 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|
> +---+--------+--------------------+-----------+--------------------+
> only showing top 7 rows
> Dataframe's schema:
> root
> |-- id: integer (nullable = true)
> |-- authorId: integer (nullable = true)
> |-- title: string (nullable = true)
> |-- releaseDate: string (nullable = true)
> |-- link: string (nullable = true)
> {noformat}
>
>
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content:
>
> {noformat}
> +--------------------+--------+--------------------+-----------+--------------------+
> | id|authorId| title|releaseDate| link|
> +--------------------+--------+--------------------+-----------+--------------------+
> | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
> | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
> | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
> | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
> | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
> | 6| 2|Development Tools...| null| null|
> |An independent st...|12/28/16|http://amzn.to/2v...| null| null|
> +--------------------+--------+--------------------+-----------+--------------------+
> only showing top 7 rows
> Dataframe's schema:
> root
> |-- id: string (nullable = true)
> |-- authorId: string (nullable = true)
> |-- title: string (nullable = true)
> |-- releaseDate: string (nullable = true)
> |-- link: string (nullable = true){noformat}
>
>
> The *multiline* option is *not recognized*. And, of course, the schema is
> wrong.
> h1. Using Apache Spark v2.2.3
> Excerpt of the dataframe content:
>
> {noformat}
> +---+--------+--------------------+-----------+--------------------+
> | id|authorId| title|releaseDate| link
> |
> +---+--------+--------------------+-----------+--------------------+
> | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
> | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
> | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
> | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
> | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
> | 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|
> | 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|
> +---+--------+--------------------+-----------+--------------------+
> only showing top 7 rows
> Dataframe's schema:
> root
> |-- id: integer (nullable = true)
> |-- authorId: integer (nullable = true)
> |-- title: string (nullable = true)
> |-- releaseDate: string (nullable = true)
> |-- link
> : string (nullable = true)
> {noformat}
>
>
> The *link* column *has a carriage return* at the end of its name. If I run
> and use:
>
> {code:java}
> df.show(7, 90);
> {code}
> I get:
>
>
> {noformat}
> +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+
> | id|authorId| title|releaseDate| link
> |
> +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+
> | 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay|
> 11/18/16|http://amzn.to/2kup94P
> |
> | 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry
> Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP
> |
> | 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)|
> 12/4/08|http://amzn.to/2kYezqr
> |
> | 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition
> (Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n
> |
> | 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the
> Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT
> |
> | 6| 2|Development Tools in 2006: any Room for a 4GL-style Language?
> An independent study by...| 12/28/16|http://amzn.to/2vBxOe1
> |
> | 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav
> |
> +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+
> {noformat}
>
>
> The carriage *return is added to my the last cell*.
> Same behavior in v2.3.3 and v2.4.0.
> If I add the schema, like in:
>
>
> {code:java}
> // Creates the schema
> StructType schema = DataTypes.createStructType(new StructField[] {
> DataTypes.createStructField(
> "id",
> DataTypes.IntegerType,
> false),
> DataTypes.createStructField(
> "authordId",
> DataTypes.IntegerType,
> true),
> DataTypes.createStructField(
> "bookTitle",
> DataTypes.StringType,
> false),
> DataTypes.createStructField(
> "releaseDate",
> DataTypes.DateType,
> true), // nullable, but this will be ignore
> DataTypes.createStructField(
> "url",
> DataTypes.StringType,
> false) });
> // GitHub version only: dumps the schema
> SchemaInspector.print(schema);
> // Reads a CSV file with header, called books.csv, stores it in a
> dataframe
> Dataset<Row> df = spark.read().format("csv")
> .option("header", "true")
> .option("multiline", true)
> .option("sep", ";")
> .option("dateFormat", "M/d/y")
> .option("quote", "*")
> .schema(schema)
> .load("data/books.csv");
> {code}
>
>
> The output is matching what is expected in any version *except version 2.1.3,
> where Spark simply crashes*.
> All the code can be downloaded from GitHub at:
> [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.]
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]