[
https://issues.apache.org/jira/browse/MADLIB-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nandish Jayaram updated MADLIB-1368:
------------------------------------
Description:
Based on our findings in this JIRA, there may be some performance hits in other
modules due to the way we use {{distributed by}} clause at the moment. After
going through the code, we noticed the following issues that we may want to
explore a bit:
*Graph modules:*
1. {{apsp.py_in}} This does not use distributed by the wrong way, but we
noticed it creates an index for Postgres.
2. {{sssp.py_in}} This does not use distributed by the wrong way, but we
noticed it creates an index for Postgres. Jira to track this and the previous
issue is https://issues.apache.org/jira/browse/MADLIB-1369
3. {{hits.py_in}} Uses distributed by with grouping, must be changed.
4. {{pagerank.py_in}} Uses distributed by with grouping, must be changed.
5. {{wcc.py_in}} Uses distributed by with grouping, must be changed. Jira to
track this is https://issues.apache.org/jira/browse/MADLIB-1367
*Non-Graph modules that use distributed by:*
1. {{logistic.py_in}} This is the only module that uses group iteration
controller from {{group_control.py_in}} which distributes rel_state table based
on grouping columns. The fix here could be to remove the distributed by clause
present in {{group_control.py_in}}.
2. {{path.py_in}} A temporary table created in path distributes it using
multiple columns, we must check if that was intentional.
3. {{encode_categorical.py_in}} The output table creation query has a
distributed by clause which uses the distribution key provided by the user as
an input param. What was the intention behind that optional param, or rather
what is the expected behavior for a given param value?
4. {{bayes.py_in}} There are a couple of distributed by clauses. Check if that
was intentional.
was:
Based on our findings in this JIRA, there may be some performance hits in other
modules due to the way we use {{distributed by}} clause at the moment. After
going through the code, we noticed the following issues that we may want to
explore a bit:
*Graph modules:*
1. {{apsp.py_in}} This does not use distributed by the wrong way, but we
noticed it creates an index for Postgres.
2. {{sssp.py_in}} This does not use distributed by the wrong way, but we
noticed it creates an index for Postgres.
3. {{hits.py_in}} Uses distributed by with grouping, must be changed.
4. {{pagerank.py_in}} Uses distributed by with grouping, must be changed.
5. {{wcc.py_in}} Uses distributed by with grouping, must be changed. Jira to
track this is https://issues.apache.org/jira/browse/MADLIB-1367
*Non-Graph modules that use distributed by:*
1. {{logistic.py_in}} This is the only module that uses group iteration
controller from {{group_control.py_in}} which distributes rel_state table based
on grouping columns. The fix here could be to remove the distributed by clause
present in {{group_control.py_in}}.
2. {{path.py_in}} A temporary table created in path distributes it using
multiple columns, we must check if that was intentional.
3. {{encode_categorical.py_in}} The output table creation query has a
distributed by clause which uses the distribution key provided by the user as
an input param. What was the intention behind that optional param, or rather
what is the expected behavior for a given param value?
4. {{bayes.py_in}} There are a couple of distributed by clauses. Check if that
was intentional.
> Identify potential performance issues in modules using distributed by clause
> ----------------------------------------------------------------------------
>
> Key: MADLIB-1368
> URL: https://issues.apache.org/jira/browse/MADLIB-1368
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Graph
> Reporter: Nandish Jayaram
> Priority: Major
> Fix For: v1.17
>
>
> Based on our findings in this JIRA, there may be some performance hits in
> other modules due to the way we use {{distributed by}} clause at the moment.
> After going through the code, we noticed the following issues that we may
> want to explore a bit:
> *Graph modules:*
> 1. {{apsp.py_in}} This does not use distributed by the wrong way, but we
> noticed it creates an index for Postgres.
> 2. {{sssp.py_in}} This does not use distributed by the wrong way, but we
> noticed it creates an index for Postgres. Jira to track this and the previous
> issue is https://issues.apache.org/jira/browse/MADLIB-1369
> 3. {{hits.py_in}} Uses distributed by with grouping, must be changed.
> 4. {{pagerank.py_in}} Uses distributed by with grouping, must be changed.
> 5. {{wcc.py_in}} Uses distributed by with grouping, must be changed. Jira to
> track this is https://issues.apache.org/jira/browse/MADLIB-1367
> *Non-Graph modules that use distributed by:*
> 1. {{logistic.py_in}} This is the only module that uses group iteration
> controller from {{group_control.py_in}} which distributes rel_state table
> based on grouping columns. The fix here could be to remove the distributed by
> clause present in {{group_control.py_in}}.
> 2. {{path.py_in}} A temporary table created in path distributes it using
> multiple columns, we must check if that was intentional.
> 3. {{encode_categorical.py_in}} The output table creation query has a
> distributed by clause which uses the distribution key provided by the user as
> an input param. What was the intention behind that optional param, or rather
> what is the expected behavior for a given param value?
> 4. {{bayes.py_in}} There are a couple of distributed by clauses. Check if
> that was intentional.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)