[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

ASF GitHub Bot (JIRA) Tue, 13 Mar 2018 08:50:21 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397136#comment-16397136
 ]


ASF GitHub Bot commented on PARQUET-1245:
-----------------------------------------

pitrou commented on a change in pull request #447: PARQUET-1245: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r174183059
 
 

 ##########
 File path: src/parquet/schema.cc
 ##########
 @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& 
node_path) const {
   return search->second;
 }
 
-int SchemaDescriptor::ColumnIndex(const Node& node) const {
-  int result = ColumnIndex(node.path()->ToDotString());
-  if (result < 0) {
-    return -1;
-  }
-  DCHECK(result < num_columns());
-  if (!node.Equals(Column(result)->schema_node().get())) {
-    // Same path but not the same node
-    return -1;
+int SchemaDescriptor::ColumnIndex(const Node* node) const {
 
 Review comment:
   By the way other methods such `Node::Equals` take a node pointer, even 
though passing a null pointer isn't supported. Should I still convert back to a 
reference?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [C++] Segfault when writing Arrow table with duplicate columns
> --------------------------------------------------------------
>
>                 Key: PARQUET-1245
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1245
>             Project: Parquet
>          Issue Type: Bug
>         Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>            Reporter: Alexey Strokach
>            Assignee: Antoine Pitrou
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

Reply via email to