[ 
https://issues.apache.org/jira/browse/SPARK-33488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33488.
----------------------------------
    Resolution: Cannot Reproduce

> Re SPARK-21820.  Creating Spark dataframe with carriage return/line feed 
> leaves cr in multiline
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33488
>                 URL: https://issues.apache.org/jira/browse/SPARK-33488
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.5
>         Environment: Apache 2.4.5
> Databricks 6.6
> Spark-NLP 2.6.3
>            Reporter: Greg Werner
>            Priority: Major
>
> In SPARK-21820 I see what seems to be the same issue reported, but marked as 
> resolved there.  Over the past few days I have battled a dataset that 
> occasionally has \r\n at the end of lines and I claim I do see this errant 
> behavior of not removing \r\n.
> In my code, I do 
> {code:java}
> // code placeholder# CSV options
> infer_schema = "false"
> first_row_is_header = "true"
> multi_line = "true"
> delimiter = ","
> # The applied options are for CSV files. For other file types, these will be 
> ignored.
> df_train = spark.read.format(train_file_type) \
>   .option("inferSchema", infer_schema) \
>   .option("header", first_row_is_header) \
>   .option("sep", delimiter) \
>   .option("multiLine", multi_line) \
>   .option("escape", '"') \
>   .load(train_file_location)
> {code}
> So I am reading in a csv file and setting multiLine to true.  However, all 
> cases where there are \r\n in the training_file, \r is left behind.  This 
> includes the header which has a column ending in \r.  The only way I have 
> been able to workaround this is to manually edit the data file to remove the 
> \r, but I do not want to do this on a case to case basis.
> Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug.
> I am using version 2.4.5 because I am using Spark-NLP which to my knowledge 
> has not been built to use 3 yet, so the version is key for me.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to