Re: [jira] [Commented] (ASTERIXDB-1208) ngram tokenizer failure with negative length

Chen Li Thu, 03 Dec 2015 08:36:20 -0800

Thanks, Taewoo.  Do you think it's easier to apply these changes
directly to Wenhai's "fuzzy branch"?



On Thu, Dec 3, 2015 at 5:51 AM, Taewoo Kim <[email protected]> wrote:
> @Wenhai:
>
> Replace NGramUTF8StringBinaryTokenizer.reset() to the following code as a
> quick temporary fix. The general fix needs to move this tokenizer into
> Asterix level so that it can properly recognize the NULL type tag so that
> it can skip token generation process.
>
>     @Override
>
>     public void reset(byte[] sentenceData, int start, int length) {
>
>         super.reset(sentenceData, start, length);
>
>         gramNum = 0;
>
>
>         int numChars = 0;
>
>         int pos = byteIndex;
>
>         int end = pos + sentenceUtf8Length;
>
>         while (pos < end) {
>
>             numChars++;
>
>             pos += UTF8StringUtil.charSize(sentenceData, pos);
>
>         }
>
>
>         if (usePrePost) {
>
>             totalGrams = numChars + gramLength - 1;
>
>         } else {
>
>             if (length >= gramLength) {
>
>                 totalGrams = numChars - gramLength + 1;
>
>             } else {
>
>                 totalGrams = 0;
>
>             }
>
>         }
>
>     }
>
> Best,
> Taewoo
>
> On Tue, Dec 1, 2015 at 7:37 PM, Taewoo Kim (JIRA) <[email protected]> wrote:
>
>>
>>     [
>> https://issues.apache.org/jira/browse/ASTERIXDB-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035204#comment-15035204
>> ]
>>
>> Taewoo Kim commented on ASTERIXDB-1208:
>> ---------------------------------------
>>
>> This error happens that the current tokenizer always assumes that it sees
>> a UTF8 string. In this case, it sees a NULL value. We need to add a logic
>> to bypass when a NULL value is provided.
>>
>> > ngram tokenizer failure with negative length
>> > --------------------------------------------
>> >
>> >                 Key: ASTERIXDB-1208
>> >                 URL:
>> https://issues.apache.org/jira/browse/ASTERIXDB-1208
>> >             Project: Apache AsterixDB
>> >          Issue Type: Bug
>> >          Components: Hyracks Core
>> >            Reporter: Wenhai
>> >            Assignee: Taewoo Kim
>> >
>> > drop dataverse test if exists;
>> > create dataverse test;
>> > use dataverse test;
>> > create type DBLPOpenType as open {
>> >   id: int64,
>> >   dblpid: string,
>> >   authors: string,
>> >   misc: string
>> > }
>> > create dataset DBLPOpen(DBLPOpenType) primary key id;
>> > insert into dataset DBLPOpen { "id": 93, "dblpid":
>> "journals/iandc/IbarraJCR91", "authors": "Some Classes of Languages in
>> NC¹", "misc": "2006-04-25 86-106 Inf. Comput. January 1991 90 1
>> db/journals/iandc/iandc90.html#IbarraJCR91" }
>> > use dataverse test;
>> > set import-private-functions 'true'
>> > for $d in dataset DBLPOpen
>> > where
>> similarity-jaccard(gram-tokens("",3,false),gram-tokens($d.title,3,false))
>> >= 0.5
>> > return {"rec": $d}
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.4#6332)
>>

Re: [jira] [Commented] (ASTERIXDB-1208) ngram tokenizer failure with negative length

Reply via email to