[jira] [Updated] (SPARK-33488) Re SPARK-21820. Creating Spark dataframe with carriage return/line feed leaves cr in multiline

GREGORY WERNER (Jira) Thu, 19 Nov 2020 04:09:08 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-33488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


GREGORY WERNER updated SPARK-33488:
-----------------------------------
    Description: 
In SPARK-21820 I see what seems to be the same issue reported, but marked as 
resolved there.  Over the past few days I have battled a dataset that 
occasionally has \r\n at the end of lines and I claim I do see this errant 
behavior of not removing \r\n.

In my code, I do 
{code:java}
// code placeholder# CSV options
infer_schema = "false"
first_row_is_header = "true"
multi_line = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be 
ignored.
df_train = spark.read.format(train_file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .option("multiLine", multi_line) \
  .option("escape", '"') \
  .load(train_file_location)
{code}
So I am reading in a csv file and setting multiLine to true.  However, all 
cases where there are \r\n in the training_file, \r is left behind.  This 
includes the header which has a column ending in \r.  The only way I have been 
able to workaround this is to manually edit the data file to remove the \r, but 
I do not want to do this on a case to case basis.

Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug.

 

  was:
In SPARK-21820 I see what seems to be the same issue reported, but marked as 
resolved there.  Over the past few days I have battled a dataset that 
occasionally has \r\n at the end of lines and I claim I do see this errant 
behavior of not removing \r\n.

In my code, I do 
{code:java}
// code placeholder# CSV options
infer_schema = "false"
first_row_is_header = "true"
multi_line = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be 
ignored.
df_train = spark.read.format(train_file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .option("multiLine", multi_line) \
  .option("escape", '"') \
  .load(train_file_location)
{code}
So I am reading in a csv file and setting multi_line to true.  However, all 
cases where there are \r\n in the training_file, \r is left behind.  This 
includes the header which has column ending in \r.  The only way I have been 
able to workaround this is to manually edit the data file to remove the \r, but 
I do not want to do this on a case to case basis.

Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug.

 


> Re SPARK-21820.  Creating Spark dataframe with carriage return/line feed 
> leaves cr in multiline
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33488
>                 URL: https://issues.apache.org/jira/browse/SPARK-33488
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.5
>         Environment: Apache 2.4.5
> Databricks 6.6
>            Reporter: GREGORY WERNER
>            Priority: Major
>
> In SPARK-21820 I see what seems to be the same issue reported, but marked as 
> resolved there.  Over the past few days I have battled a dataset that 
> occasionally has \r\n at the end of lines and I claim I do see this errant 
> behavior of not removing \r\n.
> In my code, I do 
> {code:java}
> // code placeholder# CSV options
> infer_schema = "false"
> first_row_is_header = "true"
> multi_line = "true"
> delimiter = ","
> # The applied options are for CSV files. For other file types, these will be 
> ignored.
> df_train = spark.read.format(train_file_type) \
>   .option("inferSchema", infer_schema) \
>   .option("header", first_row_is_header) \
>   .option("sep", delimiter) \
>   .option("multiLine", multi_line) \
>   .option("escape", '"') \
>   .load(train_file_location)
> {code}
> So I am reading in a csv file and setting multiLine to true.  However, all 
> cases where there are \r\n in the training_file, \r is left behind.  This 
> includes the header which has a column ending in \r.  The only way I have 
> been able to workaround this is to manually edit the data file to remove the 
> \r, but I do not want to do this on a case to case basis.
> Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33488) Re SPARK-21820. Creating Spark dataframe with carriage return/line feed leaves cr in multiline

Reply via email to