Re: Re: [jira] [Commented] (ASTERIXDB-1208) ngram tokenizer failure with negative length

李文海 Thu, 03 Dec 2015 18:53:15 -0800

> -----原始邮件-----
> 发件人: "Taewoo Kim" <[email protected]>
> 发送时间: 2015年12月4日 星期五
> 收件人: [email protected]
> 抄送: 
> 主题: Re: [jira] [Commented] (ASTERIXDB-1208) ngram tokenizer failure with 
> negative length
> 
> Yes. All we need to do is changing one method. However, this is a temporary
> fix. I will investigate more once I have more time.
> 
> On Thu, Dec 3, 2015 at 08:34 Chen Li <[email protected]> wrote:
> 
> > Thanks, Taewoo.  Do you think it's easier to apply these changes
> > directly to Wenhai's "fuzzy branch"?
> >
> >
> > On Thu, Dec 3, 2015 at 5:51 AM, Taewoo Kim <[email protected]> wrote:
> > > @Wenhai:
> > >
> > > Replace NGramUTF8StringBinaryTokenizer.reset() to the following code as a
> > > quick temporary fix. The general fix needs to move this tokenizer into
> > > Asterix level so that it can properly recognize the NULL type tag so that
> > > it can skip token generation process.
> > >
> > >     @Override
> > >
> > >     public void reset(byte[] sentenceData, int start, int length) {
> > >
> > >         super.reset(sentenceData, start, length);
> > >
> > >         gramNum = 0;
> > >
> > >
> > >         int numChars = 0;
> > >
> > >         int pos = byteIndex;
> > >
> > >         int end = pos + sentenceUtf8Length;
> > >
> > >         while (pos < end) {
> > >
> > >             numChars++;
> > >
> > >             pos += UTF8StringUtil.charSize(sentenceData, pos);
> > >
> > >         }
> > >
> > >
> > >         if (usePrePost) {
> > >
> > >             totalGrams = numChars + gramLength - 1;
> > >
> > >         } else {
> > >
> > >             if (length >= gramLength) {
> > >
> > >                 totalGrams = numChars - gramLength + 1;
> > >
> > >             } else {
> > >
> > >                 totalGrams = 0;
> > >
> > >             }
> > >
> > >         }
> > >
> > >     }
> > >
> > > Best,
> > > Taewoo
> > >
> > > On Tue, Dec 1, 2015 at 7:37 PM, Taewoo Kim (JIRA) <[email protected]>
> > wrote:
> > >
> > >>
> > >>     [
> > >>
> > https://issues.apache.org/jira/browse/ASTERIXDB-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035204#comment-15035204
> > >> ]
> > >>
> > >> Taewoo Kim commented on ASTERIXDB-1208:
> > >> ---------------------------------------
> > >>
> > >> This error happens that the current tokenizer always assumes that it
> > sees
> > >> a UTF8 string. In this case, it sees a NULL value. We need to add a
> > logic
> > >> to bypass when a NULL value is provided.
> > >>
> > >> > ngram tokenizer failure with negative length
> > >> > --------------------------------------------
> > >> >
> > >> >                 Key: ASTERIXDB-1208
> > >> >                 URL:
> > >> https://issues.apache.org/jira/browse/ASTERIXDB-1208
> > >> >             Project: Apache AsterixDB
> > >> >          Issue Type: Bug
> > >> >          Components: Hyracks Core
> > >> >            Reporter: Wenhai
> > >> >            Assignee: Taewoo Kim
> > >> >
> > >> > drop dataverse test if exists;
> > >> > create dataverse test;
> > >> > use dataverse test;
> > >> > create type DBLPOpenType as open {
> > >> >   id: int64,
> > >> >   dblpid: string,
> > >> >   authors: string,
> > >> >   misc: string
> > >> > }
> > >> > create dataset DBLPOpen(DBLPOpenType) primary key id;
> > >> > insert into dataset DBLPOpen { "id": 93, "dblpid":
> > >> "journals/iandc/IbarraJCR91", "authors": "Some Classes of Languages in
> > >> NC¹", "misc": "2006-04-25 86-106 Inf. Comput. January 1991 90 1
> > >> db/journals/iandc/iandc90.html#IbarraJCR91" }
> > >> > use dataverse test;
> > >> > set import-private-functions 'true'
> > >> > for $d in dataset DBLPOpen
> > >> > where
> > >>
> > similarity-jaccard(gram-tokens("",3,false),gram-tokens($d.title,3,false))
> > >> >= 0.5
> > >> > return {"rec": $d}
> > >>
> > >>
> > >>
> > >> --
> > >> This message was sent by Atlassian JIRA
> > >> (v6.3.4#6332)
> > >>
> >
Re: Re: [jira] [Commented] (ASTERIXDB-1208) ngram tokenizer failure with negative length

Reply via email to