[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jean Georges Perrin updated SPARK-26972: ---------------------------------------- Description: Issue with CSV import and inferSchema set to true. I found a few discrepencies while working with inferSchema set to true in CSV ingestion. Given the following CSV: {{id;authorId;title;releaseDate;link}} {{1;1;Fantastic Beasts and Where to Find Them: The Original Scree}}{{nplay;11/18/16;[http://amzn.to/2kup94P]}} {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;[http://amzn.to/2l2lSwP]}} {{3;1;The Tales of Beedle the B}}{{ard, Standard Edition (Harry Potter);12/4/08;[http://amzn.to/2kYezqr]}} {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;[http://amzn.to/2kYhL5n]}} {{5;2;Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database;4/23/17;[http://amzn.to/2i3mthT]}} {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language?}} {{An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;[http://amzn.to/2vBxOe1]}} {{7;3;Adventures of Huckleberry Finn;5/26/94;[http://amzn.to/2wOeOav]}} {{8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;[http://amzn.to/2x1NuoD]}} {{10;4;Jacques le Fataliste;3/1/00;[http://amzn.to/2uZj2KA]}} {{11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;[http://amzn.to/2i2zo3I]}} {{12;;A Woman in Berlin;7/11/06;[http://amzn.to/2i472WZ]}} {{13;6;Spring Boot in Action;1/3/16;[http://amzn.to/2hCPktW]}} {{14;6;Spring in Action: Covers Spring 4;11/28/14;[http://amzn.to/2yJLyCk]}} {{15;7;Soft Skills: The software developer's life manual;12/29/14;[http://amzn.to/2zNnSyn]}} {{16;8;Of Mice and Men;;[http://amzn.to/2zJjXoc]}} {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;[http://amzn.to/2isdqoL]}} {{18;12;Hamlet;6/8/12;[http://amzn.to/2yRbewY]}} {{19;13;Pensées;12/31/1670;[http://amzn.to/2jweHOG]}} {{20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;[http://amzn.to/2yRH10W]}} {{21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;[http://amzn.to/2hwB8zc]}} {{22;12;Twelfth Night;7/1/4;[http://amzn.to/2zPYnwo]}} {{23;12;Macbeth;7/1/3;[http://amzn.to/2zPYnwo]}} And this code: {{Dataset<Row> df = spark.read().format("csv")}} {{ .option("header", "true")}} {{ .option("multiline", true)}} {{ .option("sep", ";")}} {{ .option("quote", "*")}} {{ .option("dateFormat", "M/d/y")}} {{ .option("inferSchema", true)}} {{ .load("data/books.csv");}} {{df.show(7);}} {{df.printSchema();}} h1. In Spark v2.0.1 {{Excerpt of the dataframe content:}} {{+----+-------++-------------------------------++--------------------}} {{| id|authorId| title|releaseDate| link|}} {{+----+-------++-------------------------------++--------------------}} {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} {{+----+-------++-------------------------------++--------------------}} {{only showing top 7 rows}}{{Dataframe's schema:}} {{root}} \{{ |-- id: integer (nullable = true)}} \{{ |-- authorId: integer (nullable = true)}} \{{ |-- title: string (nullable = true)}} \{{ |-- releaseDate: string (nullable = true)}} \{{ |-- link: string (nullable = true)}} *This is fine and the expected output*. h1. Using Apache Spark v2.1.3 Excerpt of the dataframe content: {{+---------------------+-------++-------------------------------++--------------------}} \{{ | id|authorId| title|releaseDate| link|}} {{ +---------------------+-------++-------------------------------++--------------------}} \{{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} \{{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} \{{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} \{{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} \{{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} \{{ | 6| 2|Development Tools...| null| null|}} \{{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}} {{ +---------------------+-------++-------------------------------++--------------------}} \{{ only showing top 7 rows}}{{Dataframe's schema:}} \{{ root}} \{{ |-- id: string (nullable = true)}} \{{ |-- authorId: string (nullable = true)}} \{{ |-- title: string (nullable = true)}} \{{ |-- releaseDate: string (nullable = true)}} \{{ |-- link: string (nullable = true)}} The *multiline* option is *not recognized*. And, of course, the schema is wrong. h1. Using Apache Spark v2.2.3 Excerpt of the dataframe content: {{+----+-------++-------------------------------++--------------------}} {{| id|authorId| title|releaseDate| link}} {{|}} {{+----+-------++-------------------------------++--------------------}} {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} {{+----+-------++-------------------------------++--------------------}} {{only showing top 7 rows}}{{Dataframe's schema:}} {{root}} \{{ |-- id: integer (nullable = true)}} \{{ |-- authorId: integer (nullable = true)}} \{{ |-- title: string (nullable = true)}} \{{ |-- releaseDate: string (nullable = true)}} \{{ |-- link}} {{: string (nullable = true)}} The *link* column *has a carriage return* at the end of its name. If I run and use: {{df.show(7, 90);}} I get: {{+----+-------++-----------------------------------------------------------------------------------------------------++-----------------------}} {{| id|authorId| title|releaseDate| link}} {{|}} {{+----+-------++-----------------------------------------------------------------------------------------------------++-----------------------}} {{| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/16|http://amzn.to/2kup94P}} {{|}} {{| 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP}} {{|}} {{| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/4/08|http://amzn.to/2kYezqr}} {{|}} {{| 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n}} {{|}} {{| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT}} {{|}} \{{| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? }} {{An independent study by...| 12/28/16|http://amzn.to/2vBxOe1}} {{|}} {{| 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav}} {{|}} {{+----+-------++-----------------------------------------------------------------------------------------------------++-----------------------}} The carriage *return is added to my the last cell*. Same behavior in v2.3.3 and v2.4.0. If I add the schema, like in: {{StructType schema = DataTypes.createStructType(new StructField[] {}} \{{ DataTypes.createStructField(}} \{{ "id",}} \{{ DataTypes.IntegerType,}} \{{ false),}} \{{ DataTypes.createStructField(}} \{{ "authordId",}} \{{ DataTypes.IntegerType,}} \{{ true),}} \{{ DataTypes.createStructField(}} \{{ "bookTitle",}} \{{ DataTypes.StringType,}} \{{ false),}} \{{ DataTypes.createStructField(}} \{{ "releaseDate",}} \{{ DataTypes.DateType,}} \{{ true), // nullable, but this will be ignore}} \{{ DataTypes.createStructField(}} \{{ "url",}} \{{ DataTypes.StringType,}} \{{ false) });}} {{// Reads a CSV file with header, called books.csv, stores it in a dataframe}} {{Dataset<Row> df = spark.read().format("csv")}} \{{ .option("header", "true")}} \{{ .option("multiline", true)}} \{{ .option("sep", ";")}} \{{ .option("dateFormat", "M/d/y")}} \{{ .option("quote", "*")}} \{{ .schema(schema)}} \{{ .load("data/books.csv");}} The output is matching what is expected in any version *except version 2.1.3, where Spark simply crashes*. All the code can be downloaded from GitHub at: [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.] was: Issue with CSV import and inferSchema set to true. I found a few discrepencies while working with inferSchema set to true in CSV ingestion. Given the following CSV: {{id;authorId;title;releaseDate;link}} {{1;1;Fantastic Beasts and Where to Find Them: The Original Screenplay;11/18/16;http://amzn.to/2kup94P}} {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}} {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry Potter)*;12/4/08;http://amzn.to/2kYezqr}} {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}} {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}} {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }} {{An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;http://amzn.to/2vBxOe1}} {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}} {{8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD}} {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}} {{11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;http://amzn.to/2i2zo3I}} {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}} {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}} {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}} {{15;7;Soft Skills: The software developer's life manual;12/29/14;http://amzn.to/2zNnSyn}} {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}} {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;http://amzn.to/2isdqoL}} {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}} {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}} {{20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;http://amzn.to/2yRH10W}} {{21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;http://amzn.to/2hwB8zc}} {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}} {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}} And this code: {{Dataset<Row> df = spark.read().format("csv")}} {{ .option("header", "true")}} {{ .option("multiline", true)}} {{ .option("sep", ";")}} {{ .option("quote", "*")}} {{ .option("dateFormat", "M/d/y")}} {{ .option("inferSchema", true)}} {{ .load("data/books.csv");}} {{df.show(7);}} {{df.printSchema();}} h1. In Spark v2.0.1 {{Excerpt of the dataframe content:}} {{+---+--------+--------------------+-----------+--------------------+}} {{| id|authorId| title|releaseDate| link|}} {{+---+--------+--------------------+-----------+--------------------+}} {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} {{+---+--------+--------------------+-----------+--------------------+}} {{only showing top 7 rows}}{{Dataframe's schema:}} {{root}} {{ |-- id: integer (nullable = true)}} {{ |-- authorId: integer (nullable = true)}} {{ |-- title: string (nullable = true)}} {{ |-- releaseDate: string (nullable = true)}} {{ |-- link: string (nullable = true)}} *This is fine and the expected output*. h1. Using Apache Spark v2.1.3 Excerpt of the dataframe content: {{+--------------------+--------+--------------------+-----------+--------------------+}} {{ | id|authorId| title|releaseDate| link|}} {{ +--------------------+--------+--------------------+-----------+--------------------+}} {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} {{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} {{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} {{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} {{ | 6| 2|Development Tools...| null| null|}} {{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}} {{ +--------------------+--------+--------------------+-----------+--------------------+}} {{ only showing top 7 rows}}{{Dataframe's schema:}} {{ root}} {{ |-- id: string (nullable = true)}} {{ |-- authorId: string (nullable = true)}} {{ |-- title: string (nullable = true)}} {{ |-- releaseDate: string (nullable = true)}} {{ |-- link: string (nullable = true)}} The *multiline* option is *not recognized*. And, of course, the schema is wrong. h1. Using Apache Spark v2.2.3 Excerpt of the dataframe content: {{+---+--------+--------------------+-----------+--------------------+}} {{| id|authorId| title|releaseDate| link}} {{|}} {{+---+--------+--------------------+-----------+--------------------+}} {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} {{+---+--------+--------------------+-----------+--------------------+}} {{only showing top 7 rows}}{{Dataframe's schema:}} {{root}} {{ |-- id: integer (nullable = true)}} {{ |-- authorId: integer (nullable = true)}} {{ |-- title: string (nullable = true)}} {{ |-- releaseDate: string (nullable = true)}} {{ |-- link}} {{: string (nullable = true)}} The *link* column *has a carriage return* at the end of its name. If I run and use: {{df.show(7, 90);}} I get: {{+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+}} {{| id|authorId| title|releaseDate| link}} {{|}} {{+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+}} {{| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/16|http://amzn.to/2kup94P}} {{|}} {{| 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP}} {{|}} {{| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/4/08|http://amzn.to/2kYezqr}} {{|}} {{| 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n}} {{|}} {{| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT}} {{|}} {{| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? }} {{An independent study by...| 12/28/16|http://amzn.to/2vBxOe1}} {{|}} {{| 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav}} {{|}} {{+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+}} The carriage *return is added to my the last cell*. Same behavior in v2.3.3 and v2.4.0. If I add the schema, like in: {{StructType schema = DataTypes.createStructType(new StructField[] {}} {{ DataTypes.createStructField(}} {{ "id",}} {{ DataTypes.IntegerType,}} {{ false),}} {{ DataTypes.createStructField(}} {{ "authordId",}} {{ DataTypes.IntegerType,}} {{ true),}} {{ DataTypes.createStructField(}} {{ "bookTitle",}} {{ DataTypes.StringType,}} {{ false),}} {{ DataTypes.createStructField(}} {{ "releaseDate",}} {{ DataTypes.DateType,}} {{ true), // nullable, but this will be ignore}} {{ DataTypes.createStructField(}} {{ "url",}} {{ DataTypes.StringType,}} {{ false) });}} {{// Reads a CSV file with header, called books.csv, stores it in a dataframe}} {{Dataset<Row> df = spark.read().format("csv")}} {{ .option("header", "true")}} {{ .option("multiline", true)}} {{ .option("sep", ";")}} {{ .option("dateFormat", "M/d/y")}} {{ .option("quote", "*")}} {{ .schema(schema)}} {{ .load("data/books.csv");}} The output is matching what is expected in any version *except version 2.1.3, where Spark simply crashes*. All the code can be downloaded from GitHub at: [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.] > Issue with CSV import and inferSchema set to true > ------------------------------------------------- > > Key: SPARK-26972 > URL: https://issues.apache.org/jira/browse/SPARK-26972 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.1.3, 2.3.3, 2.4.0 > Environment: Java 8/Scala 2.11/MacOs > Reporter: Jean Georges Perrin > Priority: Major > Attachments: ComplexCsvToDataframeApp.java, > ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml > > > > > Issue with CSV import and inferSchema set to true. > I found a few discrepencies while working with inferSchema set to true in CSV > ingestion. > Given the following CSV: > {{id;authorId;title;releaseDate;link}} > {{1;1;Fantastic Beasts and Where to Find Them: The Original > Scree}}{{nplay;11/18/16;[http://amzn.to/2kup94P]}} > {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry > Potter; Book 1)*;10/6/15;[http://amzn.to/2l2lSwP]}} > {{3;1;The Tales of Beedle the B}}{{ard, Standard Edition (Harry > Potter);12/4/08;[http://amzn.to/2kYezqr]}} > {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition > (Harry Potter; Book 2)*;10/4/16;[http://amzn.to/2kYhL5n]}} > {{5;2;Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; and a Great Database;4/23/17;[http://amzn.to/2i3mthT]}} > {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language?}} > {{An independent study by Jean Georges Perrin, IIUG Board > Member*;12/28/16;[http://amzn.to/2vBxOe1]}} > {{7;3;Adventures of Huckleberry Finn;5/26/94;[http://amzn.to/2wOeOav]}} > {{8;3;A Connecticut Yankee in King Arthur's > Court;6/17/17;[http://amzn.to/2x1NuoD]}} > {{10;4;Jacques le Fataliste;3/1/00;[http://amzn.to/2uZj2KA]}} > {{11;4;Diderot Encyclopedia: The Complete Illustrations > 1762-1777;;[http://amzn.to/2i2zo3I]}} > {{12;;A Woman in Berlin;7/11/06;[http://amzn.to/2i472WZ]}} > {{13;6;Spring Boot in Action;1/3/16;[http://amzn.to/2hCPktW]}} > {{14;6;Spring in Action: Covers Spring 4;11/28/14;[http://amzn.to/2yJLyCk]}} > {{15;7;Soft Skills: The software developer's life > manual;12/29/14;[http://amzn.to/2zNnSyn]}} > {{16;8;Of Mice and Men;;[http://amzn.to/2zJjXoc]}} > {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style > programming*;8/28/14;[http://amzn.to/2isdqoL]}} > {{18;12;Hamlet;6/8/12;[http://amzn.to/2yRbewY]}} > {{19;13;Pensées;12/31/1670;[http://amzn.to/2jweHOG]}} > {{20;14;*Fables choisies; mises en vers par M. de La > Fontaine*;9/1/1999;[http://amzn.to/2yRH10W]}} > {{21;15;Discourse on Method and Meditations on First > Philosophy;6/15/1999;[http://amzn.to/2hwB8zc]}} > {{22;12;Twelfth Night;7/1/4;[http://amzn.to/2zPYnwo]}} > {{23;12;Macbeth;7/1/3;[http://amzn.to/2zPYnwo]}} > And this code: > {{Dataset<Row> df = spark.read().format("csv")}} > {{ .option("header", "true")}} > {{ .option("multiline", true)}} > {{ .option("sep", ";")}} > {{ .option("quote", "*")}} > {{ .option("dateFormat", "M/d/y")}} > {{ .option("inferSchema", true)}} > {{ .load("data/books.csv");}} > {{df.show(7);}} > {{df.printSchema();}} > h1. In Spark v2.0.1 > {{Excerpt of the dataframe content:}} > {{+----+-------++-------------------------------++--------------------}} > {{| id|authorId| title|releaseDate| link|}} > {{+----+-------++-------------------------------++--------------------}} > {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} > {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} > {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} > {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} > {{+----+-------++-------------------------------++--------------------}} > {{only showing top 7 rows}}{{Dataframe's schema:}} > {{root}} > \{{ |-- id: integer (nullable = true)}} > \{{ |-- authorId: integer (nullable = true)}} > \{{ |-- title: string (nullable = true)}} > \{{ |-- releaseDate: string (nullable = true)}} > \{{ |-- link: string (nullable = true)}} > *This is fine and the expected output*. > h1. Using Apache Spark v2.1.3 > Excerpt of the dataframe content: > {{+---------------------+-------++-------------------------------++--------------------}} > \{{ | id|authorId| title|releaseDate| link|}} > {{ > +---------------------+-------++-------------------------------++--------------------}} > \{{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > \{{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > \{{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > \{{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} > \{{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} > \{{ | 6| 2|Development Tools...| null| null|}} > \{{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}} > {{ > +---------------------+-------++-------------------------------++--------------------}} > \{{ only showing top 7 rows}}{{Dataframe's schema:}} > \{{ root}} > \{{ |-- id: string (nullable = true)}} > \{{ |-- authorId: string (nullable = true)}} > \{{ |-- title: string (nullable = true)}} > \{{ |-- releaseDate: string (nullable = true)}} > \{{ |-- link: string (nullable = true)}} > The *multiline* option is *not recognized*. And, of course, the schema is > wrong. > h1. Using Apache Spark v2.2.3 > Excerpt of the dataframe content: > {{+----+-------++-------------------------------++--------------------}} > {{| id|authorId| title|releaseDate| link}} > {{|}} > {{+----+-------++-------------------------------++--------------------}} > {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} > {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} > {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} > {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} > {{+----+-------++-------------------------------++--------------------}} > {{only showing top 7 rows}}{{Dataframe's schema:}} > {{root}} > \{{ |-- id: integer (nullable = true)}} > \{{ |-- authorId: integer (nullable = true)}} > \{{ |-- title: string (nullable = true)}} > \{{ |-- releaseDate: string (nullable = true)}} > \{{ |-- link}} > {{: string (nullable = true)}} > The *link* column *has a carriage return* at the end of its name. If I run > and use: > {{df.show(7, 90);}} > I get: > {{+----+-------++-----------------------------------------------------------------------------------------------------++-----------------------}} > {{| id|authorId| title|releaseDate| link}} > {{|}} > > {{+----+-------++-----------------------------------------------------------------------------------------------------++-----------------------}} > {{| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| > 11/18/16|http://amzn.to/2kup94P}} > {{|}} > {{| 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition > (Harry Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP}} > {{|}} > {{| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| > 12/4/08|http://amzn.to/2kYezqr}} > {{|}} > {{| 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition > (Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n}} > {{|}} > {{| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT}} > {{|}} > \{{| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? }} > {{An independent study by...| 12/28/16|http://amzn.to/2vBxOe1}} > {{|}} > {{| 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav}} > {{|}} > > {{+----+-------++-----------------------------------------------------------------------------------------------------++-----------------------}} > The carriage *return is added to my the last cell*. > Same behavior in v2.3.3 and v2.4.0. > If I add the schema, like in: > {{StructType schema = DataTypes.createStructType(new StructField[] {}} > \{{ DataTypes.createStructField(}} > \{{ "id",}} > \{{ DataTypes.IntegerType,}} > \{{ false),}} > \{{ DataTypes.createStructField(}} > \{{ "authordId",}} > \{{ DataTypes.IntegerType,}} > \{{ true),}} > \{{ DataTypes.createStructField(}} > \{{ "bookTitle",}} > \{{ DataTypes.StringType,}} > \{{ false),}} > \{{ DataTypes.createStructField(}} > \{{ "releaseDate",}} > \{{ DataTypes.DateType,}} > \{{ true), // nullable, but this will be ignore}} > \{{ DataTypes.createStructField(}} > \{{ "url",}} > \{{ DataTypes.StringType,}} > \{{ false) });}} > {{// Reads a CSV file with header, called books.csv, stores it in a > dataframe}} > {{Dataset<Row> df = spark.read().format("csv")}} > \{{ .option("header", "true")}} > \{{ .option("multiline", true)}} > \{{ .option("sep", ";")}} > \{{ .option("dateFormat", "M/d/y")}} > \{{ .option("quote", "*")}} > \{{ .schema(schema)}} > \{{ .load("data/books.csv");}} > The output is matching what is expected in any version *except version 2.1.3, > where Spark simply crashes*. > All the code can be downloaded from GitHub at: > [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org