[GitHub] [superset] villebro commented on a change in pull request #15279: feat: run extra query on QueryObject and add compare operator for post_processing

GitBox Wed, 23 Jun 2021 02:45:19 -0700


villebro commented on a change in pull request #15279:
URL: https://github.com/apache/superset/pull/15279#discussion_r656877011




##########
File path: superset/utils/pandas_postprocessing.py
##########
@@ -425,9 +429,11 @@ def select(
 
 
 @validate_column_args("columns")
-def diff(df: DataFrame, columns: Dict[str, str], periods: int = 1,) -> 
DataFrame:
+def diff(
+    df: DataFrame, columns: Dict[str, str], periods: int = 1, axis: int = 0,

Review comment:
       Let's be more explicit about the axis type (I personally never remember 
for certain what 0 and 1 means on axis). If Pandas isn't yet exposing types, 
maybe something like
   
   ```python
   class PandasPostprocessingAxis(int, Enum):
       ROW = 0
       COLUMN = 1
   ```

##########
File path: superset/common/query_context.py
##########
@@ -97,6 +101,62 @@ def __init__(  # pylint: disable=too-many-arguments
             "result_format": self.result_format,
         }
 
+    def processing_time_offset(
+        self, df: pd.DataFrame, query_object: QueryObject,
+    ) -> Tuple[pd.DataFrame, List[str]]:
+        # ensure query_object is immutable
+        query_object_clone = copy.copy(query_object)
+        rv_sql = []
+
+        time_offset = query_object.time_offset
+        outer_from_dttm = query_object.from_dttm
+        outer_to_dttm = query_object.to_dttm
+        for offset in time_offset:
+            try:
+                query_object_clone.from_dttm = get_past_or_future(
+                    offset, outer_from_dttm,
+                )
+                query_object_clone.to_dttm = get_past_or_future(offset, 
outer_to_dttm,)
+            except ValueError as ex:
+                raise QueryObjectValidationError(str(ex))
+            # make sure subquery use main query where clause
+            query_object_clone.inner_from_dttm = outer_from_dttm
+            query_object_clone.inner_to_dttm = outer_to_dttm
+            query_object_clone.time_offset = []

Review comment:
       I wonder if we should add `time_offset` to the `QueryObject` schema and 
rename the current one to `time_offsets`, adding both to the cache key. Example:
   
   We want to make a query with two offsets: one year ago and two years ago.
   
   The "actual" main query that gets executed and cached (no additional columns 
added yet):
   ```python
   time_offsets: None
   time_offset: None
   ```
   
   First extra query (gets concatenated to the previous dataframe):
   ```python
   time_offsets: None
   time_offset: 1
   ```
   
   Second extra query (also concatenated to the main dataframe):
   ```python
   time_offsets: None
   time_offset: 2
   ```
   
   
   Main `QueryObject`:
   ```python
   time_offsets: [1, 2]
   time_offset: None
   ```
   
   Finally, when the full query object is constructed, the following result 
would be cached with the following keys:
   
   This way the main query result would be persisted along with the results of 
the extra query results without the need to rebuild the full dataframe on each 
request, and the extra queries could then also be cached individually.

##########
File path: superset/utils/pandas_postprocessing.py
##########
@@ -436,14 +442,68 @@ def diff(df: DataFrame, columns: Dict[str, str], periods: 
int = 1,) -> DataFrame
            on diff values calculated from `y`, leaving the original column `y`
            unchanged.
     :param periods: periods to shift for calculating difference.
+    :param axis: 0 for row, 1 for column. default 0.
     :return: DataFrame with diffed columns
     :raises QueryObjectValidationError: If the request in incorrect
     """
     df_diff = df[columns.keys()]
-    df_diff = df_diff.diff(periods=periods)
+    df_diff = df_diff.diff(periods=periods, axis=axis)
     return _append_columns(df, df_diff, columns)
 
 
+@validate_column_args("source_columns", "compare_columns")
+def compare(
+    df: DataFrame,
+    source_columns: List[str],
+    compare_columns: List[str],
+    compare_type: Optional[str],

Review comment:
       Same here: `class PandasPostprocessingCompare(str, Enum)`

##########
File path: superset/common/query_context.py
##########
@@ -130,11 +191,16 @@ def get_query_result(self, query_object: QueryObject) -> 
Dict[str, Any]:
             if self.enforce_numerical_metrics:
                 self.df_metrics_to_num(df, query_object)
 
+            if query_object.time_offset:
+                df, offset_sql = self.processing_time_offset(df, query_object)
+                query += ";\n\n".join(offset_sql)
+                query += ";\n\n"

Review comment:
       We probably don't need this line?
   ```suggestion
   ```

##########
File path: superset/common/query_object.py
##########
@@ -94,6 +96,7 @@ class QueryObject:
     datasource: Optional[BaseDatasource]
     result_type: Optional[ChartDataResultType]
     is_rowcount: bool
+    time_offset: List[str]

Review comment:
       I think it would be great to deprecate the current string-based offset, 
and use a periods-based approach, much like in rolling window, using the time 
grain as the time offset unit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org
For additional commands, e-mail: notifications-h...@superset.apache.org

[GitHub] [superset] villebro commented on a change in pull request #15279: feat: run extra query on QueryObject and add compare operator for post_processing

Reply via email to