[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

ASF GitHub Bot (JIRA) Thu, 08 Mar 2018 13:12:34 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391917#comment-16391917
 ]


ASF GitHub Bot commented on ARROW-1974:
---------------------------------------

cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173292620
 
 

 ##########
 File path: src/parquet/schema.h
 ##########
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   Couple of questions:
   
   * I see [this language regarding the iteration 
order](http://en.cppreference.com/w/cpp/container/unordered_multimap) of the 
values for a particular key in the multimap:
   
   > every group of elements whose keys compare equivalent (compare equal with 
key_eq() as the comparator) forms a contiguous subrange in the iteration order
   
   Does the `iteration order` here mean that the values are iterated over in 
the order in which they were inserted?
   
   * Why did you choose to return the first one instead of returning `-1` (or 
maybe `-2`) for the `std::string` overload? Do we not want to provide a way to 
indicate that column indexes and column names are not 1:1 in the C++ API? Maybe 
that already exists.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> -----------------------------------------------------------------------
>
>                 Key: ARROW-1974
>                 URL: https://issues.apache.org/jira/browse/ARROW-1974
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.8.0
>         Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>            Reporter: Alexey Strokach
>            Assignee: Antoine Pitrou
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

Reply via email to