reductionista commented on a change in pull request #533:
URL: https://github.com/apache/madlib/pull/533#discussion_r562996777



##########
File path: src/ports/postgres/modules/utilities/utilities.py_in
##########
@@ -75,6 +75,35 @@ def get_segments_per_host():
         return max(1, count)
 # 
------------------------------------------------------------------------------
 
+def get_data_distribution_per_segment(table_name):
+    """
+    Returns a list with count of segments on each host that the input
+    table's data is distributed on.
+    :param table_name: input table name
+    :return: list with count of segments on each host that the input
+    table's data is distributed on.
+    len(return list) = total num of segments in cluster
+    """
+    if is_platform_pg():
+        return [1]
+    else:
+        res = plpy.execute("""
+                    WITH CTE AS (SELECT DISTINCT(gp_segment_id)
+                                 FROM {table_name})
+                    SELECT content, count as cnt
+                        FROM gp_segment_configuration
+                        JOIN (SELECT hostname, count(*)
+                              FROM gp_segment_configuration
+                              WHERE content in (SELECT * FROM cte)
+                              GROUP BY hostname) a
+                        USING (hostname)
+                        WHERE content in (SELECT * FROM cte)
+                    ORDER BY 1""".format(table_name=table_name))
+        data_distribution_per_segment = [0] * get_seg_number()
+        for r in res:
+            data_distribution_per_segment[r['content']] = int(r['cnt'])
+        return data_distribution_per_segment

Review comment:
       Something bothers me about this function, but I'm not sure exactly how 
to improve it.  Here are a few intersecting thoughts I'm having on it that make 
me think it ought to be refactored a bit:
   
   -  Placing this in the utilities module only makes sense if we think it 
might be useful at some point outside of the DL module.  If it's specific to 
DL, then it should be a helper function in that module somewhere.
   
   -  If it's going to be generally useful, then it should have a simple 
purpose that's easily understandable and reflected in the name of the function. 
 I would not have guessed from the name that this returns the number of 
non-empty segments per host for each host... but in a somewhat unusual format 
where the segments per host is duplicated for each non-empty segment, but set 
to zero for each empty segment.
   
   -  It seems like we're trying to do several things at once in this query, 
which makes it somewhat difficult to read.  That may be a sign that there is a 
more straightforward approach.
   
   I'm glad it is documented, which makes the name not matter as much... but 
even the documentation I find a little bit confusing.
   
   I'm not sure what would be best here, but I think I'd rename it and try to 
simply, and possibly split it into two functions.
   
   There are two main pieces of information returned here:  One is the segment 
count on each host.  And the other is a list of which segments have data on 
them.  I could imagine either of them possibly being useful outside of DL, but 
much less likely both at once.  I'm not sure we even need the 0 entries, since 
the segments without data will never get called anyway.
   
   Maybe like `get_data_segments_per_host(table)` and it returns a map from 
hosts to segments with data on them?  eg.  `{ 'host1' : [0, 1],  'host2' : [5, 
6], 'host3': [9, 10] }` ?  (Then the caller could be responsible for 
transforming it with python into whatever format is needed for that case?)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to