[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-1974: -------------------------------- Summary: [Python] Segfault when writing Arrow table with duplicate columns (was: [Python] Segfault when working with Arrow tables with duplicate columns) > [Python] Segfault when writing Arrow table with duplicate columns > ----------------------------------------------------------------- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel > Reporter: Alexey Strokach > Assignee: Antoine Pitrou > Priority: Minor > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)