esracosgun opened a new pull request, #2293: URL: https://github.com/apache/systemds/pull/2293
This pull request implements a deduplication pipeline for string tuples using distributed representations and LSH, as required for the assignment. The following steps are covered: - Distributed Representation: Each tuple is mapped to a dense vector using average GloVe embeddings. - LSH-based Blocking: Similar tuples are grouped into candidate buckets using random hyperplanes (LSH). - Similarity Computation: Candidate pairs are compared using a similarity measure (cosine or euclidean) - Duplicate Filtering: Tuples exceeding a similarity threshold are marked as duplicates; only the first occurrence is retained as unique. Outputs: - Y_unique: Deduplicated tuples (first occurrence only). - Y_duplicates: All detected duplicates removed from the input. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org