[ 
https://issues.apache.org/jira/browse/ASTERIXDB-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15598398#comment-15598398
 ] 

Taewoo Kim commented on ASTERIXDB-1704:
---------------------------------------

>From [~lwhay]:
At my first glance, I partially agree with Mike's opinion. In addition, I think 
threshold 0.2 is so small. In our fuzzyjoin ,this query will be explained as a 
prefix-based join plus a large-than selection.
In this regard, you know, the prefix will take at least (or almost equal to) 
80% of the total tokens. Pay attention to the dataset, you have so many tokens 
in each record's fields, it will take many
memory as well as produce very low pruning power, which means you are almost 
doing a nested loop join with a huge-memory requirements.
I will try this form of query in a similar dataset, and afterwards will answer 
my suggestion in detail.

>From [~lwhay]:
One fact has been missed: the best threshold in fuzzyjoin is a float value 
larger than 0.6 regardless of the token distribution. Otherwise, fuzzyjoin is 
quite impractical, but even so, I guess it is
a better choice than nested loop join and comparable to the inverted index 
based join.
Let't wait the result soon.

>From [~lwhay]:
As for the initial result in an enironment with 24 partitions in total, my 
rough suggestion are as follows
I gave a similar query
count(
for $o in dataset DBLP
for $i in dataset CSX
where similarity-jaccard(word-tokens($o.title), word-tokens($i.title)) >= 0.8 
and $o.id < $i.id
return {"oid":$o.tid, "iid":$i.tid}
);
It lasted 280 seconds for 1.2 million DBLP and 1.3 million CSX and return 
1586952.
Actually, I think the cardinality of the results is quite bad as for the 
so-called "RECALL/PRECISION"  in the information retrieval field (which means 
each record in DBLP has 1.5 similar records in CSX, so small!).
Even we reduced the threshold down to 0.6, it only returned .1742481 results in 
1318 seconds.

1. We can image the running time will be more than 10,000 seconds if we choose 
a threshold 0.2, since your total tokens are at the same scale as I have (My 
average tokens of each record is about 10 while yours are about 40 I think). 
Furthermore,
you gave a threshold 0.2 and ...
2. As for the normal join query, why we give such a small threshold: NOTICE 
here that the return cardinalities are NOT SO MUCH DIFFERENT even we gave a 
high different threshold (I think most of the retrieval researchers will 
be interested in the record pair "THEY ARE SO MUCH SIMILAR" (is almost same as 
"THEY ARE SIMILAR")  in the context of these count-based join paradigm.
3. If you can't get any result, I think maybe memory is not enough or 
partitions is so small, or we are not so much PATIENT! :)
Ok, shortly, I suggest we reduce the threshold or pre-pruning the fields' 
tokens in terms of retrieval techniques (such as field-specified dictionary). 
OR give a high selective condition below the fuzzyjoin to reduce the join 
cardinalities.
Otherwise, we can ONLY wait for so long, ommm, maybe we can compare it with the 
nested loop join and inverted index based join, and give some general or 
empirical results.

> Fuzzy-join query is slow
> ------------------------
>
>                 Key: ASTERIXDB-1704
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1704
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>
> I have an issue regarding the prefix-based fuzzy join (non-index based fuzzy 
> join) on a small dataset. The following query runs forever even for a dataset 
> with 200K records on 9 nodes. So, each node only has 20,000 records. Also, 
> the record size is not that big. 
> {code}
> count(
> for $o in dataset AmazonReview
> for $i in dataset AmazonReview
> where similarity-jaccard(word-tokens($o.reviewText), 
> word-tokens($i.reviewText)) >= 0.2 and $o.id < $i.id
> return {"oid":$o.reviewrID, "iid":$i.reviewID}
> );
> {code}
> An example record is as follows.  
> {code}
> {
>   "reviewerID": "A2SUAM1J3GNN3B",
>   "asin": "0000013714",
>   "reviewerName": "J. McDonald",
>   "helpful": [2, 3],
>   "reviewText": "I bought this for my husband who plays the piano.  He is 
> having a wonderful time playing these old hymns.  The music  is at times hard 
> to read because we think the book was published for singing from more than 
> playing from.  Great purchase though!",
>   "overall": 5.0,
>   "summary": "Heavenly Highway Hymns",
>   "unixReviewTime": 1252800000,
>   "reviewTime": "09 13, 2009"
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to