[ 
https://issues.apache.org/jira/browse/BEAM-11393?focusedWorklogId=612228&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-612228
 ]

ASF GitHub Bot logged work on BEAM-11393:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Jun/21 00:13
            Start Date: 19/Jun/21 00:13
    Worklog Time Spent: 10m 
      Work Description: TheNeuralBit commented on pull request #13476:
URL: https://github.com/apache/beam/pull/13476#issuecomment-864327186


   > Is this still relevant?
   
   Yeah I already added back support for grouping by a Series in 
https://github.com/apache/beam/pull/14929
   
   I think this test still won't work actually, but I realized the issue 
exposed by this bug is in aligning a non-unique index. It's not anything 
inherently wrong with grouping by a series. I commented on BEAM-11393 with an 
explanation:
   
   > Note pandas also cannot handle this case when the grouping series has a 
different index. Pandas only works for this example because it can detect they 
have the same index and it recognizes it doesn't need to do anything to align 
them.
   > 
   > With better tracking of partitioning we may able to do the same sort of 
thing.
   
   Let's close this for now, but I'll keep BEAM-11393 open to track the align 
issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 612228)
    Time Spent: 0.5h  (was: 20m)

> Support grouping by a Series
> ----------------------------
>
>                 Key: BEAM-11393
>                 URL: https://issues.apache.org/jira/browse/BEAM-11393
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Brian Hulette
>            Priority: P3
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> grouping by a Series (e.g. \{{df.groupby(df.column)}}, 
> \{{series.groupby(other_series)}}) does not work. The previous implementation 
> relied on aligning the index between the two deferred frames, but it's 
> possible that one or both frames will have duplicate values in their index. 
> Leading to the following error at execution time:
> {code}
>     Traceback (most recent call last):                                        
>                                                                               
>                                                                             
>       File 
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py",
>  line 237, in fix                                                             
>                                               
>         computed = self.compute(to_compute)                                   
>                                                                               
>                                                                             
>       File 
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py",
>  line 195, in compute_using_session
>         return {                                                              
>                                                                               
>                                                                             
>       File 
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py",
>  line 196, in <dictcomp>                                              
>         name: frame._expr.evaluate_at(session)                                
>                                      
>       File 
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
>  line 329, in evaluate_at                        
>         return self._func(*(session.evaluate(arg) for arg in self._args))     
>                                      
>       File 
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
>  line 329, in <genexpr>                                             
>         return self._func(*(session.evaluate(arg) for arg in self._args))     
>                                      
>       File 
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
>  line 144, in evaluate                           
>         result = evaluate_with(input_partitioning)                            
>                                                                               
>                                                                               
>     File 
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
>  line 114, in evaluate_with
>         results.append(session.evaluate(expr))                                
>                                                                               
>                                                                             
>       File 
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
>  line 42, in evaluate
>         self._bindings[expr] = expr.evaluate_at(self)                         
>                                                                               
>                                                                             
>       File 
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
>  line 329, in evaluate_at
>         return self._func(*(session.evaluate(arg) for arg in self._args))     
>                                                                               
>                                                                             
>       File 
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/frames.py",
>  line 149, in set_index
>         df, by = df.align(by, axis=0, join='inner')                           
>                                                                               
>                                                                             
>       File 
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/frame.py",
>  line 3962, in align                                                          
>                                                return super().align(          
>                    
>       File 
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/generic.py",
>  line 8559, in align                                   
>         return self._align_series(                        
>       File 
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/generic.py",
>  line 8681, in _align_series                                                  
>     
>         fdata = fdata.reindex_indexer(join_index, lidx, axis=1)               
>                                                                               
>                                                                               
>     File 
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/internals/managers.py",
>  line 1276, in reindex_indexer
>         self.axes[axis]._can_reindex(indexer)             
>       File 
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/indexes/base.py",
>  line 3289, in _can_reindex                                                   
>                                         raise ValueError("cannot reindex from 
> a duplicate axis")                                                    
>     ValueError: cannot reindex from a duplicate axis           
> {code}
> Discovered in https://github.com/apache/beam/pull/13401, GHA run: 
> https://github.com/apache/beam/runs/1445605501



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to