Earle Lyons created ARROW-16810:
-----------------------------------
Summary: PyArrow: write_dataset - Could not open CSV input source
Key: ARROW-16810
URL: https://issues.apache.org/jira/browse/ARROW-16810
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 8.0.0
Environment: Pop!_OS 20.04 LTS OS & Conda 4.11.0 /Mamba 0.23.0
Environment
Reporter: Earle Lyons
Hi Arrow Community!
Happy Friday! I am a new user to Arrow, specifically using pyarrow. However, I
am very excited about the project.
I am experiencing issues with the '{*}write_dataset'{*} function from the
'{*}dataset{*}' module. Please forgive me, if this is a known issue. However, I
have searched the GitHub 'Issues', as well as Stack Overflow and I have not
identified a similar issue.
I have a directory that contains 90 CSV files (essentially one CSV for each day
between 2021-01-01 and 2021-03-31). My objective was to read all the CSV files
into a dataset and write the dataset to a single Parquet file format.
Unfortunately, some of the CSV files contained nulls in some columns, which
presented some issues which were resolved by specifying DataTypes with the
following Stack Overflow solution:
[How do I specify a dtype for all columns when reading a CSV file with
pyarrow?|https://stackoverflow.com/questions/71533197/how-do-i-specify-a-dtype-for-all-columns-when-reading-a-csv-file-with-pyarrow]
The following code works on the first pass.
{code:python}
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as ds
import re
{code}
{code:python}
pa.__version__
'8.0.0'
{code}
{code:python}
column_types = {}
csv_path = '/home/user/csv_files'
field_re_pattern = "value_*"
# Open a dataset with the 'csv_path' path and 'csv' file format
# and assign to 'dataset1'
dataset1 = ds.dataset(csv_path, format='csv')
# Loop through each field in the 'dataset1' schema,
# match the 'field_re_pattern' regex pattern in the field name,
# and assign 'int64' DataType to the field.name in the 'column_types'
# dictionary
for field in (field for field in dataset1.schema \
if re.match(field_re_pattern, field.name)):
column_types[field.name] = pa.int64()
# Creates options for CSV data using the 'column_types' dictionary
# This returns a <class 'pyarrow._csv.ConvertOptions'>
convert_options = csv.ConvertOptions(column_types=column_types)
# Creates FileFormat for CSV using the 'convert_options'
# This returns a <class 'pyarrow._dataset.CsvFileFormat'>
custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
# Open a a dataset with the 'csv_path' path, instead of using the
# 'csv' file format, use the 'custom_csv_format' and assign to
# 'dataset2'
dataset2 = ds.dataset(csv_path, format=custom_csv_format)
# Write the 'dataset2' to the 'csv_path' base directory in the
# 'parquet' format, and overwrite/ignore if the file exists
ds.write_dataset(dataset2, base_dir=csv_path, format='parquet',
existing_data_behavior='overwrite_or_ignore')
{code}
As previously stated, on first pass, the code works and creates a single
parquet file (part-0.parquet) with the correct data, row count, and schema.
However, if the code is run again, the following error is encountered:
{code:python}
ArrowInvalid: Could not open CSV input source
'/home/user/csv_files/part-0.parquet': Invalid: CSV parse error: Row #2:
Expected 4 columns, got 1: 6NQJRJV02XW$0Y8V p A$A18CEBS
305DEM030TTW �5HZ50GCVJV1CSV
{code}
My interpretation of the error is that on the second pass the 'dataset2'
variable now includes the 'part-0.parquet' file (which can be confirmed with
the `dataset2.files` output showing the file) and the CSV reader is attempting
to parse/read the parquet file.
If this is the case, is there an argument to ignore the parquet file and only
evaluate the CSV files? Also, if a dataset object has a format of 'csv' or
'pyarrow._dataset.CsvFileFormat' associated with it would be nice to evaluate
only CSV files and not all file types in the path. If that is not the current
behavior.
If this is not the case, any ideas on the cause or solution?
Any assistance would be greatly appreciated.
Thank you and have a great day!
--
This message was sent by Atlassian Jira
(v8.20.7#820007)