No, there is no way to load CSV files with irregular dimensions, and we don't have any plans currently to support them. Sorry :-(
Regards Antoine. Le 19/11/2019 à 05:54, Micah Kornfield a écrit : > +dev@arrow to see if there is a more definitive answer, but I don't believe > this type of functionality is supported currently. > > > > > On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari < > elisa.scandell...@gmail.com> wrote: > >> Hi, >> I'm trying to improve the performance of my program that loads csv data >> and manipulates it. >> My CSV file contains 14 million rows and has a variable amount of columns. >> The first 27 columns will always be available, and a row can have up to 16 >> more columns for a total of 43. >> >> Using vanilla pandas I've found this workaround: >> ``` >> >> >> >> >> >> >> >> >> >> >> *largest_column_count = 0with open(data_file, 'r') as temp_f: lines = >> temp_f.readlines() for l in lines: column_count = >> len(l.split(',')) + 1 largest_column_count = column_count if >> largest_column_count < column_count else >> largest_column_counttemp_f.close()column_names = [i for i in range(0, >> largest_column_count)]all_columns_df = pd.read_csv(file, header=None, >> delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '', >> regex=True)*``` >> This will create the table with all my data plus empty cells where the >> data is not available. >> With a smaller file, this works perfectly well. With the complete file, my >> memory usage goes over the roof. >> >> I've been reading about Apache Arrow and, after a few attempts to load a >> structured csv file (same amount of columns for every row), I'm extremely >> impressed. >> I've tried to load my data file, using the same concept as above: >> ``` >> >> >> >> >> >> >> >> >> >> >> >> *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names = >> [str(i) for i in range(len(fixed_column_names), >> largest_column_count)]total_columns = >> fixed_column_namestotal_columns.extend(extra_column_names)read_options = >> csv.ReadOptions(column_names=total_columns)convert_options = >> csv.ConvertOptions(include_columns=total_columns, >> include_missing_columns=True, >> strings_can_be_null=True)table = csv.read_csv(edr_filename, >> read_options=read_options, convert_options=convert_options)* >> ``` >> but I get the following error >> ****Exception: CSV parse error: Expected 43 columns, got 32**** >> >> I need to use the csv provided by pyarrow, if not I wouldn't be able to >> create the pyarrow table to then convert to pandas >> ```from pyarrow import csv``` >> >> I guess that the csv library provided by pyarrow is more streamlined than >> the complete one. >> >> Is there any way I can load this file? Maybe using some ReadOptions and/or >> ConvertOptions? >> I'd be using pandas to manipulate the data after it's been loaded. >> >> Thank you in advance >> >> >