[jira] [Work logged] (BEAM-11532) df.merge with identically-named `on` columns produces duplicate output columns

ASF GitHub Bot (Jira) Wed, 30 Dec 2020 10:20:05 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-11532?focusedWorklogId=529588&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-529588
 ]


ASF GitHub Bot logged work on BEAM-11532:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 30/Dec/20 18:19
            Start Date: 30/Dec/20 18:19
    Worklog Time Spent: 10m 
      Work Description: TheNeuralBit commented on a change in pull request 
#13634:
URL: https://github.com/apache/beam/pull/13634#discussion_r550284105



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -1218,15 +1219,32 @@ def merge(
     merged = frame_base.DeferredFrame.wrap(
         expressions.ComputedExpression(
             'merge',
-            lambda left, right: left.merge(
-                right, left_index=True, right_index=True, **kwargs),
+            lambda left, right: left.merge(right,
+                                           left_index=True,
+                                           right_index=True,
+                                           suffixes=suffixes,
+                                           **kwargs),
             [indexed_left._expr, indexed_right._expr],
             preserves_partition_by=partitionings.Singleton(),
             requires_partition_by=partitionings.Index()))
 
     if left_index or right_index:
       return merged
     else:
+      common_cols = set(left_on).intersection(right_on)
+      if len(common_cols):
+        # When merging on the same column name from both dfs, merged will have
+        # two duplicate columns, one with lsuffix and one with rsuffix.
+        # Normally pandas de-dupes these into a single column with no suffix.
+        # This replicates that logic by dropping the _right_ dupe, and removing
+        # the suffix from the _left_ dupe.
+        lsuffix, rsuffix = suffixes
+        merged = merged.drop(

Review comment:
       In that case I think it would have been renamed to 
{col}{rsuffix}{rsuffix} - this is a good edge case to think about though. I'll 
look at adding some more test cases. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 529588)
    Time Spent: 40m  (was: 0.5h)

> df.merge with identically-named `on` columns produces duplicate output columns
> ------------------------------------------------------------------------------
>
>                 Key: BEAM-11532
>                 URL: https://issues.apache.org/jira/browse/BEAM-11532
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>    Affects Versions: 2.26.0, 2.27.0
>            Reporter: Brian Hulette
>            Assignee: Brian Hulette
>            Priority: P1
>             Fix For: 2.28.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> For example, when joining on 'a' in both df1 and df2:
> {code}
> Failed example:
>     df1.merge(df2, how='inner', on='a')
> Expected:
>     0   foo  1  3
>           a  b  c
> Got:
>     0  foo  1  foo  3
>        a_x  b  a_y  c
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-11532) df.merge with identically-named `on` columns produces duplicate output columns

Reply via email to