[ https://issues.apache.org/jira/browse/ARROW-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-6481: ---------------------------------- Labels: pull-request-available (was: ) > [Python][C++] Bad performance of read_csv() with column_types > ------------------------------------------------------------- > > Key: ARROW-6481 > URL: https://issues.apache.org/jira/browse/ARROW-6481 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.14.1 > Environment: ubuntu xenial, python2.7 > Reporter: Bogdan Klichuk > Assignee: Antoine Pitrou > Priority: Major > Labels: pull-request-available > Attachments: 20k_cols.csv > > > Case: Dataset wit 20k columns. Amount of rows can be 0. > {{pyarrow.csv.read_csv('20k_cols.csv')}} works rather fine if no > convert_options provided. > Took 150ms. > Now I call {{read_csv()}} with column types mapping that marks 2000 out of > these columns as string. > {{pyarrow.csv.read_csv('20k_cols.csv', > convert_options=pyarrow.csv.ConvertOptions(column_types=\{'K%d' % i: > pyarrow.string() for i in range(2000)}))}} > (K1..K19999 are column names in attached dataset). > My task globally is to read everything as string, avoid any inferring. > This takes several minutes, consumes around 4GB memory. > This doesn't look sane at all. -- This message was sent by Atlassian Jira (v8.3.2#803003)