[GitHub] [madlib] reductionista commented on a change in pull request #571: Fast parallel-optimized DBSCAN

GitBox Wed, 15 Sep 2021 16:51:23 -0700


reductionista commented on a change in pull request #571:
URL: https://github.com/apache/madlib/pull/571#discussion_r709652611




##########
File path: src/ports/postgres/modules/dbscan/dbscan.py_in
##########
@@ -164,76 +274,1271 @@ def dbscan(schema_madlib, source_table, output_table, 
id_column, expr_point, eps
         """.format(**locals())
         plpy.execute(sql)
 
+        # -- Generate output summary table ---
+
         output_summary_table = add_postfix(output_table, '_summary')
-        plpy.execute("DROP TABLE IF EXISTS {0}".format(output_summary_table))
 
+        plpy.execute("DROP TABLE IF EXISTS {0}".format(output_summary_table))
         sql = """
             CREATE TABLE {output_summary_table} AS
             SELECT  '{id_column}'::VARCHAR AS id_column,
+                    '{expr_point}'::VARCHAR AS expr_point,
                     {eps}::DOUBLE PRECISION AS eps,
+                    {min_samples}::BIGINT AS min_points,

Review comment:
       Hmm... I guess we could do that.
   
   That reminds me, I was planning to ask @orhankislal 's about this.   I'm 
guessing the main reason why the summary table originally only included `eps` 
and `metric` is that these are both required inputs for predict to work, while 
the others are not.  But I'm not sure if there may have been other motivations 
for keeping the summary table minimal (consistency with other modules?)
   
   I added `min_samples` to this branch when I noticed how difficult its 
absence made some tasks.  If you have several DBSCAN output tables laying 
around from a while back, and you want to run it on a new one that's similar to 
the others (eg. scaled up or down in size with random sampling, noise added, a 
test/validation split, etc.) but you never wrote down what inputs were used, 
you have to regenerate all of the outputs again since there is no way of 
knowing what value of min_samples was used.   While you'll get different 
outputs for different values of min_samples, you can't know for sure what value 
was used because it could be that there just weren't any clusters smaller than 
a certain size.  That could be an important feature of your input data, or it 
could just be an artifact of the inputs used.
   
   I didn't add `algorithm` or `max_depth` because they are not inputs to 
DBSCAN in the computational sense of mapping a set of inputs to a set of 
outputs.  If you know what the input table is, and you know `eps` and 
`min_samples`, those 3 together determine what the output looks like (aside 
from some slight randomness in which cluster a point near multiple clusters 
gets assigned to, due to the order in which points are considered).  Neither 
`max_depth` nor `algorithm` affect the output, they only affect the runtime.  
So my thinking was... there's probably no reason a data scientist would care 
later by what means the output was generated, as long as they have all the 
inputs.
   
   I don't have any strong objection to including `algorithm` in the summary, 
but if we do then I guess we should also include `max_depth` since that also 
affects the runtime and is sort of like choosing a sub-algorithm among 
different variants of `optimized`... all of which should give the same result.  
There is one use-case I can think of for it, which is performance comparisons 
of DBSCAN running on similar datasets, such as scale testing.  It would at 
least be useful for developers, if someone wants to extend DBSCAN or improve 
it.  And possibly for some end-users, if they wanted to do some 
runtime-calibration on smaller datasets before attempting something very large.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@madlib.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [madlib] reductionista commented on a change in pull request #571: Fast parallel-optimized DBSCAN

Reply via email to