[
https://issues.apache.org/jira/browse/SPARK-42359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684683#comment-17684683
]
Willi Raschkowski commented on SPARK-42359:
-------------------------------------------
In our experience such CSV files tend to be Excel exports where users like to
populate rows above the header with descriptions of the data.
To give a real-world example: [here's a dataset made available by the UK
government
(data.gov.uk)|https://www.data.gov.uk/dataset/9003012e-4564-4a6b-b5f0-8765ccb23a03/average-road-fuel-sales-deliveries-and-stock-levels].
The dataset is only available via Excel files that look like this:
!Screenshot 2023-02-06 at 13.23.34.png!
Exporting from Excel for consumption in Spark results in a CSV that looks like
this:
{code}
cat
~/Downloads/20230202_Average_road_fuel_sales_deliveries_and_stock_levels.csv |
head -n 15 | cut -c1-150
"Average road fuel deliveries at sampled filling stations: United Kingdom, from
27 January 2020 [note 1][note 2][note 3]",,,,,,,,,,,,,,,,,,,,,,,,,,,,
This worksheet contains one table. Some cells refer to notes which can be found
in the notes worksheet.,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"Freeze panes are turned on. To turn off freeze panes select the 'View' ribbon
then 'Freeze Panes' then 'Unfreeze Panes' or use [Alt,W,F]",,,,,,,,,,,,
Source:
BEIS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Released: 02 February
2023,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Return to
contents,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Units: Volume in
litres,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Date,Weekday,Fuel Type,North East,North West,Yorkshire and The Humber,"East
Midlands","West
Midlands",East,London,South East,South West,Northern
Ireland,Wales,Scotland,"England
[note 3]",United
Kingdom,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
27/01/2020,Monday,Diesel," 10,583 "," 9,422 "," 11,687 "," 11,205 "," 11,353
"," 10,284 "," 7,501 "," 10,023 "," 9,535 "," 8,511 "," 9,961 "," 9,600 "
28/01/2020,Tuesday,Diesel," 11,643 "," 10,440 "," 13,172 "," 11,885 "," 12,943
"," 12,255 "," 7,310 "," 10,106 "," 11,144 "," 7,740 "," 10,306 "," 10,
29/01/2020,Wednesday,Diesel," 10,839 "," 10,021 "," 11,417 "," 12,195 ","
11,370 "," 12,542 "," 8,102 "," 11,235 "," 10,840 "," 6,943 "," 11,532 "," 9
30/01/2020,Thursday,Diesel," 8,808 "," 10,673 "," 11,871 "," 13,469 "," 12,727
"," 12,445 "," 7,708 "," 11,044 "," 9,741 "," 7,456 "," 10,647 "," 10,2
{code}
> Support row skipping when reading CSV files
> -------------------------------------------
>
> Key: SPARK-42359
> URL: https://issues.apache.org/jira/browse/SPARK-42359
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.3.1
> Reporter: Willi Raschkowski
> Priority: Major
> Attachments: Screenshot 2023-02-06 at 13.23.34.png
>
>
> Spark currently can't read CSV files that contain lines with comments or
> annotations above the header and data. Work-arounds include pre-processing
> CSVs, or using RDDs and something like {{zipWithIndex}}. But all of these
> increase friction for less technical users.
> This issue proposes a {{skipLines}} option for Spark's CSV parser to drop a
> number of unwanted lines at the top of a CSV file.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]