[ 
https://issues.apache.org/jira/browse/ASTERIXDB-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599076#comment-15599076
 ] 

Wenhai commented on ASTERIXDB-1704:
-----------------------------------

BTW, I noticed the fuzzyjoin branch on the latest master is 6 time slower than 
the same environment in two months ago. I am trying to figure out the reason. 
My experimental hardware is detailed as follows:
Single machine with two NCs, each is configured with 12 partitions and 15gb 
global memory budget, sort memory 320mb, join memory 320. nc.opt is 30gb.
The CPU is 2.8GHz.
The running trace almost requires no network communication (single machine) and 
almost no disk io (with the hardware memory 128GB in total).

> Fuzzy-join query is slow
> ------------------------
>
>                 Key: ASTERIXDB-1704
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1704
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>
> I have an issue regarding the prefix-based fuzzy join (non-index based fuzzy 
> join) on a small dataset. The following query runs forever even for a dataset 
> with 200K records on 9 nodes. So, each node only has 20,000 records. Also, 
> the record size is not that big. 
> {code}
> count(
> for $o in dataset AmazonReview
> for $i in dataset AmazonReview
> where similarity-jaccard(word-tokens($o.reviewText), 
> word-tokens($i.reviewText)) >= 0.2 and $o.id < $i.id
> return {"oid":$o.reviewrID, "iid":$i.reviewID}
> );
> {code}
> An example record is as follows.  
> {code}
> {
>   "reviewerID": "A2SUAM1J3GNN3B",
>   "asin": "0000013714",
>   "reviewerName": "J. McDonald",
>   "helpful": [2, 3],
>   "reviewText": "I bought this for my husband who plays the piano.  He is 
> having a wonderful time playing these old hymns.  The music  is at times hard 
> to read because we think the book was published for singing from more than 
> playing from.  Great purchase though!",
>   "overall": 5.0,
>   "summary": "Heavenly Highway Hymns",
>   "unixReviewTime": 1252800000,
>   "reviewTime": "09 13, 2009"
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to