[
https://issues.apache.org/jira/browse/BEAM-11393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243445#comment-17243445
]
Brian Hulette edited comment on BEAM-11393 at 12/3/20, 6:51 PM:
----------------------------------------------------------------
I think its impossible to handle this case as it's order-sensitive when there
are duplicate values, and we don't know ahead of time if there will be. The
only case we might add logic to handle (without adding support for preserving
order), is when grouping a dataframe by one of it's series. Or more generally,
we might detect that two frames have the same index.
was (Author: bhulette):
I think its impossible to handle this case as it's order-sensitive when there
are duplicate values, and we don't know ahead of time if there will be. The
only case we might add logic to handle (without adding support for preserving
order), is when grouping a dataframe by one of it's series.
> Support grouping by a Series
> ----------------------------
>
> Key: BEAM-11393
> URL: https://issues.apache.org/jira/browse/BEAM-11393
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Reporter: Brian Hulette
> Priority: P2
>
> grouping by a Series (e.g. \{{df.groupby(df.column)}},
> \{{series.groupby(other_series)}}) does not work. The previous implementation
> relied on aligning the index between the two deferred frames, but it's
> possible that one or both frames will have duplicate values in their index.
> Leading to the following error at execution time:
> {code}
> Traceback (most recent call last):
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py",
> line 237, in fix
>
> computed = self.compute(to_compute)
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py",
> line 195, in compute_using_session
> return {
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py",
> line 196, in <dictcomp>
> name: frame._expr.evaluate_at(session)
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
> line 329, in evaluate_at
> return self._func(*(session.evaluate(arg) for arg in self._args))
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
> line 329, in <genexpr>
> return self._func(*(session.evaluate(arg) for arg in self._args))
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
> line 144, in evaluate
> result = evaluate_with(input_partitioning)
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
> line 114, in evaluate_with
> results.append(session.evaluate(expr))
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
> line 42, in evaluate
> self._bindings[expr] = expr.evaluate_at(self)
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py",
> line 329, in evaluate_at
> return self._func(*(session.evaluate(arg) for arg in self._args))
>
>
> File
> "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/frames.py",
> line 149, in set_index
> df, by = df.align(by, axis=0, join='inner')
>
>
> File
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/frame.py",
> line 3962, in align
> return super().align(
>
> File
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/generic.py",
> line 8559, in align
> return self._align_series(
> File
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/generic.py",
> line 8681, in _align_series
>
> fdata = fdata.reindex_indexer(join_index, lidx, axis=1)
>
>
> File
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/internals/managers.py",
> line 1276, in reindex_indexer
> self.axes[axis]._can_reindex(indexer)
> File
> "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/indexes/base.py",
> line 3289, in _can_reindex
> raise ValueError("cannot reindex from
> a duplicate axis")
> ValueError: cannot reindex from a duplicate axis
> {code}
> Discovered in https://github.com/apache/beam/pull/13401, GHA run:
> https://github.com/apache/beam/runs/1445605501
--
This message was sent by Atlassian Jira
(v8.3.4#803005)