[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391418#comment-16391418
 ] 

ASF GitHub Bot commented on ARROW-1974:
---------------------------------------

pitrou commented on issue #447: ARROW-1974: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371527191
 
 
   Unfortunately this doesn't seem sufficient. If I add the following test, I 
get an error and a crash:
   ```diff
   diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc 
b/src/parquet/arrow/arrow-reader-writer-test.cc
   index 72e65d4..eb5a8ec 100644
   --- a/src/parquet/arrow/arrow-reader-writer-test.cc
   +++ b/src/parquet/arrow/arrow-reader-writer-test.cc
   @@ -1669,6 +1669,27 @@ TEST(TestArrowReadWrite, TableWithChunkedColumns) {
      }
    }
    
   +TEST(TestArrowReadWrite, TableWithDuplicateColumns) {
   +  using ::arrow::ArrayFromVector;
   +
   +  auto f0 = field("duplicate", ::arrow::int8());
   +  auto f1 = field("duplicate", ::arrow::int16());
   +  auto schema = ::arrow::schema({f0, f1});
   +
   +  std::vector<int8_t> a0_values = {1, 2, 3};
   +  std::vector<int16_t> a1_values = {14, 15, 16};
   +
   +  std::shared_ptr<Array> a0, a1;
   +
   +  ArrayFromVector<::arrow::Int8Type, int8_t>(a0_values, &a0);
   +  ArrayFromVector<::arrow::Int16Type, int16_t>(a1_values, &a1);
   +
   +  auto table = Table::Make(schema,
   +                           {std::make_shared<Column>(f0->name(), a0),
   +                            std::make_shared<Column>(f1->name(), a1)});
   +  CheckSimpleRoundtrip(table, table->num_rows());
   +}
   +
    TEST(TestArrowWrite, CheckChunkSize) {
      const int num_columns = 2;
      const int num_rows = 128;
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> -----------------------------------------------------------------------
>
>                 Key: ARROW-1974
>                 URL: https://issues.apache.org/jira/browse/ARROW-1974
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.8.0
>         Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>            Reporter: Alexey Strokach
>            Assignee: Phillip Cloud
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to