reductionista commented on a change in pull request #533: URL: https://github.com/apache/madlib/pull/533#discussion_r562996777
########## File path: src/ports/postgres/modules/utilities/utilities.py_in ########## @@ -75,6 +75,35 @@ def get_segments_per_host(): return max(1, count) # ------------------------------------------------------------------------------ +def get_data_distribution_per_segment(table_name): + """ + Returns a list with count of segments on each host that the input + table's data is distributed on. + :param table_name: input table name + :return: list with count of segments on each host that the input + table's data is distributed on. + len(return list) = total num of segments in cluster + """ + if is_platform_pg(): + return [1] + else: + res = plpy.execute(""" + WITH CTE AS (SELECT DISTINCT(gp_segment_id) + FROM {table_name}) + SELECT content, count as cnt + FROM gp_segment_configuration + JOIN (SELECT hostname, count(*) + FROM gp_segment_configuration + WHERE content in (SELECT * FROM cte) + GROUP BY hostname) a + USING (hostname) + WHERE content in (SELECT * FROM cte) + ORDER BY 1""".format(table_name=table_name)) + data_distribution_per_segment = [0] * get_seg_number() + for r in res: + data_distribution_per_segment[r['content']] = int(r['cnt']) + return data_distribution_per_segment Review comment: Something bothers me about this function, but I'm not sure exactly how to improve it. Here are a few intersecting thoughts I'm having on it that make me think it ought to be refactored a bit: - Placing this in the utilities module only makes sense if we think it might be useful at some point outside of the DL module. If it's specific to DL, then it should be a helper function in that module somewhere. - If it's going to be generally useful, then it should have a simple purpose that's easily understandable and reflected in the name of the function. I would not have guessed from the name that this returns the number of non-empty segments per host for each host... but in a somewhat unusual format where the segments per host is duplicated for each non-empty segment, but set to zero for each empty segment. - It seems like we're trying to do several things at once in this query, which makes it somewhat difficult to read. That may be a sign that there is a more straightforward approach. I'm glad it is documented, which makes the name not matter as much... but even the documentation I find a little bit confusing. I'm not sure what would be best here, but I think I'd rename it and try to simply, and possibly split it into two functions. There are two main pieces of information returned here: One is the segment count on each host. And the other is a list of which segments have data on them. I could imagine either of them possibly being useful outside of DL, but much less likely both at once. I'm not sure we even need the 0 entries, since the segments without data will never get called anyway. Maybe like `get_data_segments_per_host(table)` and it returns a map from hosts to segments with data on them? eg. `{ 'host1' : [0, 1], 'host2' : [5, 6], 'host3': [9, 10] }` ? (Then the caller could be responsible for transforming it with python into whatever format is needed for that case?) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org