[
https://issues.apache.org/jira/browse/BEAM-11393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kenneth Knowles updated BEAM-11393:
-----------------------------------
This Jira ticket has a pull request attached to it, but is still open. Did the
pull request resolve the issue? If so, could you please mark it resolved? This
will help the project have a clear view of its open issues.
> Support grouping by a Series
> ----------------------------
>
> Key: BEAM-11393
> URL: https://issues.apache.org/jira/browse/BEAM-11393
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Reporter: Brian Hulette
> Priority: P3
> Time Spent: 40m
> Remaining Estimate: 0h
>
> grouping by a Series (e.g. \{{df.groupby(df.column)}},
> \{{series.groupby(other_series)}}) does not work. The previous implementation
> relied on aligning the index between the two deferred frames, but it's
> possible that one or both frames will have duplicate values in their index.
> Leading to the following error at execution time:
> {code}
> Traceback (most recent call last):
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py",
> line 237, in fix
>
> computed = self.compute(to_compute)
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py",
> line 195, in compute_using_session
> return {
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py",
> line 196, in <dictcomp>
> name: frame._expr.evaluate_at(session)
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
> line 329, in evaluate_at
> return self._func(*(session.evaluate(arg) for arg in self._args))
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
> line 329, in <genexpr>
> return self._func(*(session.evaluate(arg) for arg in self._args))
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
> line 144, in evaluate
> result = evaluate_with(input_partitioning)
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
> line 114, in evaluate_with
> results.append(session.evaluate(expr))
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
> line 42, in evaluate
> self._bindings[expr] = expr.evaluate_at(self)
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
> line 329, in evaluate_at
> return self._func(*(session.evaluate(arg) for arg in self._args))
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/frames.py",
> line 149, in set_index
> df, by = df.align(by, axis=0, join='inner')
>
>
> File
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/frame.py",
> line 3962, in align
> return super().align(
>
> File
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/generic.py",
> line 8559, in align
> return self._align_series(
> File
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/generic.py",
> line 8681, in _align_series
>
> fdata = fdata.reindex_indexer(join_index, lidx, axis=1)
>
>
> File
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/internals/managers.py",
> line 1276, in reindex_indexer
> self.axes[axis]._can_reindex(indexer)
> File
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/indexes/base.py",
> line 3289, in _can_reindex
> raise ValueError("cannot reindex from
> a duplicate axis")
> ValueError: cannot reindex from a duplicate axis
> {code}
> Discovered in https://github.com/apache/beam/pull/13401, GHA run:
> https://github.com/apache/beam/runs/1445605501
--
This message was sent by Atlassian Jira
(v8.20.1#820001)