esracosgun opened a new pull request, #2293:
URL: https://github.com/apache/systemds/pull/2293

   This pull request implements a deduplication pipeline for string tuples 
using distributed representations and LSH, as required for the assignment. The 
following steps are covered:
   - Distributed Representation: Each tuple is mapped to a dense vector using 
average GloVe embeddings.
   - LSH-based Blocking: Similar tuples are grouped into candidate buckets 
using random hyperplanes (LSH).
   - Similarity Computation: Candidate pairs are compared using a similarity 
measure (cosine or euclidean)
   - Duplicate Filtering: Tuples exceeding a similarity threshold are marked as 
duplicates; only the first occurrence is retained as unique.
   
   Outputs:
   - Y_unique: Deduplicated tuples (first occurrence only).
   - Y_duplicates: All detected duplicates removed from the input.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to