Thanks, Taewoo. Do you think it's easier to apply these changes directly to Wenhai's "fuzzy branch"?
On Thu, Dec 3, 2015 at 5:51 AM, Taewoo Kim <[email protected]> wrote: > @Wenhai: > > Replace NGramUTF8StringBinaryTokenizer.reset() to the following code as a > quick temporary fix. The general fix needs to move this tokenizer into > Asterix level so that it can properly recognize the NULL type tag so that > it can skip token generation process. > > @Override > > public void reset(byte[] sentenceData, int start, int length) { > > super.reset(sentenceData, start, length); > > gramNum = 0; > > > int numChars = 0; > > int pos = byteIndex; > > int end = pos + sentenceUtf8Length; > > while (pos < end) { > > numChars++; > > pos += UTF8StringUtil.charSize(sentenceData, pos); > > } > > > if (usePrePost) { > > totalGrams = numChars + gramLength - 1; > > } else { > > if (length >= gramLength) { > > totalGrams = numChars - gramLength + 1; > > } else { > > totalGrams = 0; > > } > > } > > } > > Best, > Taewoo > > On Tue, Dec 1, 2015 at 7:37 PM, Taewoo Kim (JIRA) <[email protected]> wrote: > >> >> [ >> https://issues.apache.org/jira/browse/ASTERIXDB-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035204#comment-15035204 >> ] >> >> Taewoo Kim commented on ASTERIXDB-1208: >> --------------------------------------- >> >> This error happens that the current tokenizer always assumes that it sees >> a UTF8 string. In this case, it sees a NULL value. We need to add a logic >> to bypass when a NULL value is provided. >> >> > ngram tokenizer failure with negative length >> > -------------------------------------------- >> > >> > Key: ASTERIXDB-1208 >> > URL: >> https://issues.apache.org/jira/browse/ASTERIXDB-1208 >> > Project: Apache AsterixDB >> > Issue Type: Bug >> > Components: Hyracks Core >> > Reporter: Wenhai >> > Assignee: Taewoo Kim >> > >> > drop dataverse test if exists; >> > create dataverse test; >> > use dataverse test; >> > create type DBLPOpenType as open { >> > id: int64, >> > dblpid: string, >> > authors: string, >> > misc: string >> > } >> > create dataset DBLPOpen(DBLPOpenType) primary key id; >> > insert into dataset DBLPOpen { "id": 93, "dblpid": >> "journals/iandc/IbarraJCR91", "authors": "Some Classes of Languages in >> NC¹", "misc": "2006-04-25 86-106 Inf. Comput. January 1991 90 1 >> db/journals/iandc/iandc90.html#IbarraJCR91" } >> > use dataverse test; >> > set import-private-functions 'true' >> > for $d in dataset DBLPOpen >> > where >> similarity-jaccard(gram-tokens("",3,false),gram-tokens($d.title,3,false)) >> >= 0.5 >> > return {"rec": $d} >> >> >> >> -- >> This message was sent by Atlassian JIRA >> (v6.3.4#6332) >>
