[I] The wcc algorithm output contains a large amount of duplicate data [geaflow]

via GitHub Tue, 03 Mar 2026 01:02:41 -0800


jyswpp opened a new issue, #761:
URL: https://github.com/apache/geaflow/issues/761


   I found that the output file from the WCC algorithm contains duplicate data, 
and I suspect that intermediate results of the algorithm were also exported.
   
   the wcc sql is:
   
   CREATE GRAPH cc_graph_test (  
     Vertex nodes (  
       id bigint ID
     ),  
     Edge edges (  
       srcId bigint SOURCE ID,  
       targetId bigint DESTINATION ID
     )  
   ) WITH (  
     storeType='memory',  
     shardCount = 1  
   );  
     
   INSERT INTO cc_graph_test.nodes(id) VALUES  
   (1),  
   (2),  
   (3),  
   (4),  
   (5),  
   (6);  
     
   INSERT INTO cc_graph_test.edges VALUES  
   (1, 2),   
   (2, 3),  
   (4, 5),  
   (5, 6)
   ;  
   
   CREATE TABLE IF NOT EXISTS cc_geaflow_test (
     v_id int,
     k_value VARCHAR
   ) WITH (
       type='file',
       `geaflow.dsl.table.parallelism`= 64,
       `geaflow.dsl.source.parallelism` = 64,
       `geaflow.file.persistent.config.json` = '{\'*******'}',
       `geaflow.dsl.file.path` = '*******',
       `geaflow.dsl.column.separator`='\s'
   );
   
   USE GRAPH cc_graph_test;
   insert into cc_geaflow_test(v_id, k_value)
   CALL wcc() YIELD (vid, component)
   RETURN vid, component;
   
   output is :
   1s1
   1s1
   2s1
   1s1
   2s1
   3s1
   1s1
   2s1
   3s1
   4s4
   1s1
   2s1
   3s1
   4s4
   5s4
   1s1
   2s1
   3s1
   4s4
   5s4
   6s4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] The wcc algorithm output contains a large amount of duplicate data [geaflow]

Reply via email to