[
https://issues.apache.org/jira/browse/ASTERIXDB-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475341#comment-15475341
]
Taewoo Kim edited comment on ASTERIXDB-1487 at 9/8/16 11:26 PM:
----------------------------------------------------------------
What's the main reason here?
was (Author: wangsaeu):
What's the main issue here?
> Fuzzy select-join on inverted index poses inconsistent results.
> ---------------------------------------------------------------
>
> Key: ASTERIXDB-1487
> URL: https://issues.apache.org/jira/browse/ASTERIXDB-1487
> Project: Apache AsterixDB
> Issue Type: Bug
> Components: AsterixDB
> Environment: MAC 4 cores, 8GB memory. The current master till
> 3/17/2016.
> Reporter: Wenhai
> Assignee: Wenhai
> Priority: Critical
> Attachments: csx-small-multi-id.txt, dblp-small-multi-id.txt
>
>
> As shown in below. After we switching the two "for" branches of the fuzzy
> join over a select, the results are consistent.
> Schema
> {noformat}
> drop dataverse test if exists;
> create dataverse test;
> use dataverse test;
> create type DBLPNestedType as closed {
> id: int64,
> dblpid: string,
> title: string,
> authors: string,
> misc: string
> }
> create type DBLPType as closed {
> nested: DBLPNestedType
> }
> create type CSXNestedType as closed {
> id: int64,
> csxid: string,
> title: string,
> authors: string,
> misc: string
> }
> create type CSXType as closed {
> nested: CSXNestedType
> }
> create dataset DBLPtmp(DBLPNestedType) primary key id;
> create dataset CSXtmp(CSXNestedType) primary key id;
> create dataset DBLP(DBLPType) primary key nested.id;
> create dataset CSX(CSXType) primary key nested.id;
> use dataverse test;
> load dataset DBLPtmp
> using localfs
> (("path"="asterix_nc1://data/dblp-small/dblp-small-multi-id.txt"),("format"="delimited-text"),("delimiter"=":"),("quote"="\u0000"))
> pre-sorted;
> load dataset CSXtmp
> using localfs
> (("path"="asterix_nc1://data/pub-small/csx-small-multi-id.txt"),("format"="delimited-text"),("delimiter"=":"),("quote"="\u0000"));
> insert into dataset DBLP(
> for $x in dataset DBLPtmp
> return {
> "nested": $x
> }
> );
> insert into dataset CSX(
> for $x in dataset CSXtmp
> return {
> "nested": $x
> }
> );
> {noformat}
> Indexes
> {noformat}
> create index keyword_index on DBLP(nested.title) type keyword;
> create index keyword_indexdbauhors on DBLP(nested.authors) type keyword;
> create index keyword_indexcsxauthors on CSX(nested.authors) type keyword;
> {noformat}
> The following query
> {noformat}
> use dataverse test;
> set simthresholds '.1'
> let $s := count(
> for $o in dataset DBLP
> for $t in dataset CSX
> where contains($o.nested.title, "System") and word-tokens($o.nested.authors)
> ~= word-tokens($t.nested.authors)
> return $o
> )
> return $s
> {noformat}
> will return 28, while the query
> {noformat}
> use dataverse test;
> set simthresholds '.1'
> let $s := count(
> for $t in dataset CSX
> for $o in dataset DBLP
> where contains($o.nested.title, "System") and word-tokens($o.nested.authors)
> ~= word-tokens($t.nested.authors)
> return $o
> )
> return $s
> {noformat}
> will return 3 or pose a error in a big dataset.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)