[ https://issues.apache.org/jira/browse/SPARK-33488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-33488. ---------------------------------- Resolution: Cannot Reproduce > Re SPARK-21820. Creating Spark dataframe with carriage return/line feed > leaves cr in multiline > ----------------------------------------------------------------------------------------------- > > Key: SPARK-33488 > URL: https://issues.apache.org/jira/browse/SPARK-33488 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.5 > Environment: Apache 2.4.5 > Databricks 6.6 > Spark-NLP 2.6.3 > Reporter: Greg Werner > Priority: Major > > In SPARK-21820 I see what seems to be the same issue reported, but marked as > resolved there. Over the past few days I have battled a dataset that > occasionally has \r\n at the end of lines and I claim I do see this errant > behavior of not removing \r\n. > In my code, I do > {code:java} > // code placeholder# CSV options > infer_schema = "false" > first_row_is_header = "true" > multi_line = "true" > delimiter = "," > # The applied options are for CSV files. For other file types, these will be > ignored. > df_train = spark.read.format(train_file_type) \ > .option("inferSchema", infer_schema) \ > .option("header", first_row_is_header) \ > .option("sep", delimiter) \ > .option("multiLine", multi_line) \ > .option("escape", '"') \ > .load(train_file_location) > {code} > So I am reading in a csv file and setting multiLine to true. However, all > cases where there are \r\n in the training_file, \r is left behind. This > includes the header which has a column ending in \r. The only way I have > been able to workaround this is to manually edit the data file to remove the > \r, but I do not want to do this on a case to case basis. > Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug. > I am using version 2.4.5 because I am using Spark-NLP which to my knowledge > has not been built to use 3 yet, so the version is key for me. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org