[jira] [Work logged] (BEAM-9547) Implement all pandas operations (or raise WontImplementError)

ASF GitHub Bot (Jira) Wed, 21 Oct 2020 17:23:49 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-9547?focusedWorklogId=503454&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-503454
 ]


ASF GitHub Bot logged work on BEAM-9547:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 22/Oct/20 00:22
            Start Date: 22/Oct/20 00:22
    Worklog Time Spent: 10m 
      Work Description: robertwb commented on a change in pull request #13122:
URL: https://github.com/apache/beam/pull/13122#discussion_r507933364



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -32,9 +32,74 @@ def __array__(self, dtype=None):
     raise frame_base.WontImplementError(
         'Conversion to a non-deferred a numpy array.')
 
+  get = frame_base.not_implemented_method('get')
+
 
 @frame_base.DeferredFrame._register_for(pd.Series)
 class DeferredSeries(DeferredDataFrameOrSeries):
+  def __getitem__(self, key):
+    if _is_null_slice(key) or key is Ellipsis:
+      return self
+
+    elif (isinstance(key, int) or _is_integer_slice(key)
+          ) and self._expr.proxy().index._should_fallback_to_positional():
+      raise frame_base.WontImplementError('order sensitive')
+
+    elif isinstance(key, slice) or callable(key):
+      return frame_base.DeferredFrame.wrap(
+          expressions.ComputedExpression(
+              # yapf: disable
+              'getitem',
+              lambda df: df[key],
+              [self._expr],
+              requires_partition_by=partitionings.Nothing(),
+              preserves_partition_by=partitionings.Singleton()))
+
+    elif isinstance(key, DeferredSeries):
+      return frame_base.DeferredFrame.wrap(
+          expressions.ComputedExpression(
+              # yapf: disable
+              'getitem',
+              lambda df, indexer: df[indexer],
+              [self._expr, key._expr],
+              requires_partition_by=partitionings.Index(),
+              preserves_partition_by=partitionings.Singleton()))
+
+    elif pd.core.series.is_iterator(key) or 
pd.core.common.is_bool_indexer(key):
+      raise frame_base.WontImplementError('order sensitive')
+
+    else:
+      # We could consider returning a deferred scalar, but that might
+      # be more surprising than a clear error.
+      raise frame_base.WontImplementError('non-deferred')
+
+    if isinstance(key, frame_base.DeferredBase):
+      # Fail early if key is a DeferredBase as it interacts surprisingly with
+      # key in self._expr.proxy().columns
+      raise NotImplementedError(
+          "Indexing with a deferred frame is not yet supported. Consider "
+          "using df.loc[...]")
+
+    if isinstance(key, slice):
+      types = set([type(key.start), type(key.stop), type(key.step)])
+      if types == {type(None)}:
+        # Empty slice is just a copy.
+        return frame_base.DeferredFrame.wrap(self._expr)
+      elif types in [{int}, {type(None), int}]:

Review comment:
       Ah, yes, I meant to go back and change this. Thanks. 

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -32,9 +32,74 @@ def __array__(self, dtype=None):
     raise frame_base.WontImplementError(
         'Conversion to a non-deferred a numpy array.')
 
+  get = frame_base.not_implemented_method('get')
+
 
 @frame_base.DeferredFrame._register_for(pd.Series)
 class DeferredSeries(DeferredDataFrameOrSeries):
+  def __getitem__(self, key):
+    if _is_null_slice(key) or key is Ellipsis:
+      return self
+
+    elif (isinstance(key, int) or _is_integer_slice(key)
+          ) and self._expr.proxy().index._should_fallback_to_positional():
+      raise frame_base.WontImplementError('order sensitive')
+
+    elif isinstance(key, slice) or callable(key):
+      return frame_base.DeferredFrame.wrap(
+          expressions.ComputedExpression(
+              # yapf: disable
+              'getitem',
+              lambda df: df[key],
+              [self._expr],
+              requires_partition_by=partitionings.Nothing(),
+              preserves_partition_by=partitionings.Singleton()))
+
+    elif isinstance(key, DeferredSeries):
+      return frame_base.DeferredFrame.wrap(
+          expressions.ComputedExpression(
+              # yapf: disable
+              'getitem',
+              lambda df, indexer: df[indexer],
+              [self._expr, key._expr],
+              requires_partition_by=partitionings.Index(),
+              preserves_partition_by=partitionings.Singleton()))
+
+    elif pd.core.series.is_iterator(key) or 
pd.core.common.is_bool_indexer(key):
+      raise frame_base.WontImplementError('order sensitive')
+
+    else:
+      # We could consider returning a deferred scalar, but that might
+      # be more surprising than a clear error.
+      raise frame_base.WontImplementError('non-deferred')
+
+    if isinstance(key, frame_base.DeferredBase):
+      # Fail early if key is a DeferredBase as it interacts surprisingly with
+      # key in self._expr.proxy().columns
+      raise NotImplementedError(
+          "Indexing with a deferred frame is not yet supported. Consider "
+          "using df.loc[...]")
+
+    if isinstance(key, slice):
+      types = set([type(key.start), type(key.stop), type(key.step)])
+      if types == {type(None)}:
+        # Empty slice is just a copy.
+        return frame_base.DeferredFrame.wrap(self._expr)
+      elif types in [{int}, {type(None), int}]:
+        # This depends on the contents of the index.
+        raise frame_base.WontImplementError(
+            'Use iloc or loc with integer slices.')

Review comment:
       Eventually we may make iloc work for integer indices, but if not they'll 
get a better error there. The problem with directing users to loc directly is 
that `df.loc[ix]` is not a drop in replacement for `df[ix]` here, in fact it 
can be quite different, and so we need to force people to think about what 
they're trying to do. 

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -453,17 +518,31 @@ def __getattr__(self, name):
 
   def __getitem__(self, key):
     # TODO: Replicate pd.DataFrame.__getitem__ logic
-    if isinstance(key, frame_base.DeferredBase):
+    if isinstance(key, DeferredSeries) and key._expr.proxy().dtype == bool:

Review comment:
       Ah, yes, done. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 503454)
    Time Spent: 27h 50m  (was: 27h 40m)

> Implement all pandas operations (or raise WontImplementError)
> -------------------------------------------------------------
>
>                 Key: BEAM-9547
>                 URL: https://issues.apache.org/jira/browse/BEAM-9547
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Robert Bradshaw
>            Priority: P2
>          Time Spent: 27h 50m
>  Remaining Estimate: 0h
>
> We should have an implementation for every DataFrame, Series, and GroupBy 
> method. Everything that's not actually implemented should get a default 
> implementation that raises WontImplementError
> See https://github.com/apache/beam/pull/10757#discussion_r389132292
> Progress at the individual operation level is tracked in a 
> [spreadsheet|https://docs.google.com/spreadsheets/d/1hHAaJ0n0k2tw465ORs5tfdy4Lg0DnGWIQ53cLjAhel0/edit],
>  consider requesting edit access if you'd like to help out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9547) Implement all pandas operations (or raise WontImplementError)

Reply via email to