[
https://issues.apache.org/jira/browse/ARROW-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Earle Lyons updated ARROW-16810:
--------------------------------
Summary: [Python] PyArrow: write_dataset - Could not open CSV input source
(was: PyArrow: write_dataset - Could not open CSV input source)
> [Python] PyArrow: write_dataset - Could not open CSV input source
> -----------------------------------------------------------------
>
> Key: ARROW-16810
> URL: https://issues.apache.org/jira/browse/ARROW-16810
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 8.0.0
> Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0
> Environment
> Reporter: Earle Lyons
> Priority: Minor
>
> Hi Arrow Community!
> Happy Friday! I am a new user to Arrow, specifically using pyarrow. However,
> I am very excited about the project.
> I am experiencing issues with the '{*}write_dataset'{*} function from the
> '{*}dataset{*}' module. Please forgive me, if this is a known issue. However,
> I have searched the GitHub 'Issues', as well as Stack Overflow and I have not
> identified a similar issue.
> I have a directory that contains 90 CSV files (essentially one CSV for each
> day between 2021-01-01 and 2021-03-31). My objective was to read all the CSV
> files into a dataset and write the dataset to a single Parquet file format.
> Unfortunately, some of the CSV files contained nulls in some columns, which
> presented some issues which were resolved by specifying DataTypes with the
> following Stack Overflow solution:
> [How do I specify a dtype for all columns when reading a CSV file with
> pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
> The following code works on the first pass.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> import re
> {code}
> {code:python}
> pa.__version__
> '8.0.0'
> {code}
> {code:python}
> column_types = {}
> csv_path = '/home/user/csv_files'
> field_re_pattern = "value_*"
> # Open a dataset with the 'csv_path' path and 'csv' file format
> # and assign to 'dataset1'
> dataset1 = ds.dataset(csv_path, format='csv')
> # Loop through each field in the 'dataset1' schema,
> # match the 'field_re_pattern' regex pattern in the field name,
> # and assign 'int64' DataType to the field.name in the 'column_types'
> # dictionary
> for field in (field for field in dataset1.schema \
> if re.match(field_re_pattern, field.name)):
> column_types[field.name] = pa.int64()
> # Creates options for CSV data using the 'column_types' dictionary
> # This returns a <class 'pyarrow._csv.ConvertOptions'>
> convert_options = csv.ConvertOptions(column_types=column_types)
> # Creates FileFormat for CSV using the 'convert_options'
> # This returns a <class 'pyarrow._dataset.CsvFileFormat'>
> custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
> # Open a a dataset with the 'csv_path' path, instead of using the
> # 'csv' file format, use the 'custom_csv_format' and assign to
> # 'dataset2'
> dataset2 = ds.dataset(csv_path, format=custom_csv_format)
> # Write the 'dataset2' to the 'csv_path' base directory in the
> # 'parquet' format, and overwrite/ignore if the file exists
> ds.write_dataset(dataset2, base_dir=csv_path, format='parquet',
> existing_data_behavior='overwrite_or_ignore')
> {code}
> As previously stated, on first pass, the code works and creates a single
> parquet file (part-0.parquet) with the correct data, row count, and schema.
> However, if the code is run again, the following error is encountered:
> {code:python}
> ArrowInvalid: Could not open CSV input source
> '/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2:
> Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p A$A18CEBS
> 305DEM030TTW �5HZ50GCVJV1CSV
> {code}
> My interpretation of the error is that on the second pass the 'dataset2'
> variable now includes the 'part-0.parquet' file (which can be confirmed with
> the `dataset2.files` output showing the file) and the CSV reader is
> attempting to parse/read the parquet file.
> If this is the case, is there an argument to ignore the parquet file and only
> evaluate the CSV files? Also, if a dataset object has a format of 'csv' or
> 'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate
> only CSV files and not all file types in the path. If that is not the current
> behavior.
> If this is not the case, any ideas on the cause or solution?
> Any assistance would be greatly appreciated.
> Thank you and have a great day!
--
This message was sent by Atlassian Jira
(v8.20.7#820007)