[ https://issues.apache.org/jira/browse/SPARK-33488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
GREGORY WERNER updated SPARK-33488: ----------------------------------- Description: In SPARK-21820 I see what seems to be the same issue reported, but marked as resolved there. Over the past few days I have battled a dataset that occasionally has \r\n at the end of lines and I claim I do see this errant behavior of not removing \r\n. In my code, I do {code:java} // code placeholder# CSV options infer_schema = "false" first_row_is_header = "true" multi_line = "true" delimiter = "," # The applied options are for CSV files. For other file types, these will be ignored. df_train = spark.read.format(train_file_type) \ .option("inferSchema", infer_schema) \ .option("header", first_row_is_header) \ .option("sep", delimiter) \ .option("multiLine", multi_line) \ .option("escape", '"') \ .load(train_file_location) {code} So I am reading in a csv file and setting multiLine to true. However, all cases where there are \r\n in the training_file, \r is left behind. This includes the header which has a column ending in \r. The only way I have been able to workaround this is to manually edit the data file to remove the \r, but I do not want to do this on a case to case basis. Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug. was: In SPARK-21820 I see what seems to be the same issue reported, but marked as resolved there. Over the past few days I have battled a dataset that occasionally has \r\n at the end of lines and I claim I do see this errant behavior of not removing \r\n. In my code, I do {code:java} // code placeholder# CSV options infer_schema = "false" first_row_is_header = "true" multi_line = "true" delimiter = "," # The applied options are for CSV files. For other file types, these will be ignored. df_train = spark.read.format(train_file_type) \ .option("inferSchema", infer_schema) \ .option("header", first_row_is_header) \ .option("sep", delimiter) \ .option("multiLine", multi_line) \ .option("escape", '"') \ .load(train_file_location) {code} So I am reading in a csv file and setting multi_line to true. However, all cases where there are \r\n in the training_file, \r is left behind. This includes the header which has column ending in \r. The only way I have been able to workaround this is to manually edit the data file to remove the \r, but I do not want to do this on a case to case basis. Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug. > Re SPARK-21820. Creating Spark dataframe with carriage return/line feed > leaves cr in multiline > ----------------------------------------------------------------------------------------------- > > Key: SPARK-33488 > URL: https://issues.apache.org/jira/browse/SPARK-33488 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.5 > Environment: Apache 2.4.5 > Databricks 6.6 > Reporter: GREGORY WERNER > Priority: Major > > In SPARK-21820 I see what seems to be the same issue reported, but marked as > resolved there. Over the past few days I have battled a dataset that > occasionally has \r\n at the end of lines and I claim I do see this errant > behavior of not removing \r\n. > In my code, I do > {code:java} > // code placeholder# CSV options > infer_schema = "false" > first_row_is_header = "true" > multi_line = "true" > delimiter = "," > # The applied options are for CSV files. For other file types, these will be > ignored. > df_train = spark.read.format(train_file_type) \ > .option("inferSchema", infer_schema) \ > .option("header", first_row_is_header) \ > .option("sep", delimiter) \ > .option("multiLine", multi_line) \ > .option("escape", '"') \ > .load(train_file_location) > {code} > So I am reading in a csv file and setting multiLine to true. However, all > cases where there are \r\n in the training_file, \r is left behind. This > includes the header which has a column ending in \r. The only way I have > been able to workaround this is to manually edit the data file to remove the > \r, but I do not want to do this on a case to case basis. > Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org