[ https://issues.apache.org/jira/browse/ASTERIXDB-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15598398#comment-15598398 ]
Taewoo Kim commented on ASTERIXDB-1704: --------------------------------------- >From [~lwhay]: At my first glance, I partially agree with Mike's opinion. In addition, I think threshold 0.2 is so small. In our fuzzyjoin ,this query will be explained as a prefix-based join plus a large-than selection. In this regard, you know, the prefix will take at least (or almost equal to) 80% of the total tokens. Pay attention to the dataset, you have so many tokens in each record's fields, it will take many memory as well as produce very low pruning power, which means you are almost doing a nested loop join with a huge-memory requirements. I will try this form of query in a similar dataset, and afterwards will answer my suggestion in detail. >From [~lwhay]: One fact has been missed: the best threshold in fuzzyjoin is a float value larger than 0.6 regardless of the token distribution. Otherwise, fuzzyjoin is quite impractical, but even so, I guess it is a better choice than nested loop join and comparable to the inverted index based join. Let't wait the result soon. >From [~lwhay]: As for the initial result in an enironment with 24 partitions in total, my rough suggestion are as follows I gave a similar query count( for $o in dataset DBLP for $i in dataset CSX where similarity-jaccard(word-tokens($o.title), word-tokens($i.title)) >= 0.8 and $o.id < $i.id return {"oid":$o.tid, "iid":$i.tid} ); It lasted 280 seconds for 1.2 million DBLP and 1.3 million CSX and return 1586952. Actually, I think the cardinality of the results is quite bad as for the so-called "RECALL/PRECISION" in the information retrieval field (which means each record in DBLP has 1.5 similar records in CSX, so small!). Even we reduced the threshold down to 0.6, it only returned .1742481 results in 1318 seconds. 1. We can image the running time will be more than 10,000 seconds if we choose a threshold 0.2, since your total tokens are at the same scale as I have (My average tokens of each record is about 10 while yours are about 40 I think). Furthermore, you gave a threshold 0.2 and ... 2. As for the normal join query, why we give such a small threshold: NOTICE here that the return cardinalities are NOT SO MUCH DIFFERENT even we gave a high different threshold (I think most of the retrieval researchers will be interested in the record pair "THEY ARE SO MUCH SIMILAR" (is almost same as "THEY ARE SIMILAR") in the context of these count-based join paradigm. 3. If you can't get any result, I think maybe memory is not enough or partitions is so small, or we are not so much PATIENT! :) Ok, shortly, I suggest we reduce the threshold or pre-pruning the fields' tokens in terms of retrieval techniques (such as field-specified dictionary). OR give a high selective condition below the fuzzyjoin to reduce the join cardinalities. Otherwise, we can ONLY wait for so long, ommm, maybe we can compare it with the nested loop join and inverted index based join, and give some general or empirical results. > Fuzzy-join query is slow > ------------------------ > > Key: ASTERIXDB-1704 > URL: https://issues.apache.org/jira/browse/ASTERIXDB-1704 > Project: Apache AsterixDB > Issue Type: Bug > Reporter: Taewoo Kim > > I have an issue regarding the prefix-based fuzzy join (non-index based fuzzy > join) on a small dataset. The following query runs forever even for a dataset > with 200K records on 9 nodes. So, each node only has 20,000 records. Also, > the record size is not that big. > {code} > count( > for $o in dataset AmazonReview > for $i in dataset AmazonReview > where similarity-jaccard(word-tokens($o.reviewText), > word-tokens($i.reviewText)) >= 0.2 and $o.id < $i.id > return {"oid":$o.reviewrID, "iid":$i.reviewID} > ); > {code} > An example record is as follows. > {code} > { > "reviewerID": "A2SUAM1J3GNN3B", > "asin": "0000013714", > "reviewerName": "J. McDonald", > "helpful": [2, 3], > "reviewText": "I bought this for my husband who plays the piano. He is > having a wonderful time playing these old hymns. The music is at times hard > to read because we think the book was published for singing from more than > playing from. Great purchase though!", > "overall": 5.0, > "summary": "Heavenly Highway Hymns", > "unixReviewTime": 1252800000, > "reviewTime": "09 13, 2009" > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)