> -----原始邮件-----
> 发件人: "Taewoo Kim" <[email protected]>
> 发送时间: 2015年12月4日 星期五
> 收件人: [email protected]
> 抄送:
> 主题: Re: [jira] [Commented] (ASTERIXDB-1208) ngram tokenizer failure with
> negative length
>
> Yes. All we need to do is changing one method. However, this is a temporary
> fix. I will investigate more once I have more time.
>
> On Thu, Dec 3, 2015 at 08:34 Chen Li <[email protected]> wrote:
>
> > Thanks, Taewoo. Do you think it's easier to apply these changes
> > directly to Wenhai's "fuzzy branch"?
> >
> >
> > On Thu, Dec 3, 2015 at 5:51 AM, Taewoo Kim <[email protected]> wrote:
> > > @Wenhai:
> > >
> > > Replace NGramUTF8StringBinaryTokenizer.reset() to the following code as a
> > > quick temporary fix. The general fix needs to move this tokenizer into
> > > Asterix level so that it can properly recognize the NULL type tag so that
> > > it can skip token generation process.
> > >
> > > @Override
> > >
> > > public void reset(byte[] sentenceData, int start, int length) {
> > >
> > > super.reset(sentenceData, start, length);
> > >
> > > gramNum = 0;
> > >
> > >
> > > int numChars = 0;
> > >
> > > int pos = byteIndex;
> > >
> > > int end = pos + sentenceUtf8Length;
> > >
> > > while (pos < end) {
> > >
> > > numChars++;
> > >
> > > pos += UTF8StringUtil.charSize(sentenceData, pos);
> > >
> > > }
> > >
> > >
> > > if (usePrePost) {
> > >
> > > totalGrams = numChars + gramLength - 1;
> > >
> > > } else {
> > >
> > > if (length >= gramLength) {
> > >
> > > totalGrams = numChars - gramLength + 1;
> > >
> > > } else {
> > >
> > > totalGrams = 0;
> > >
> > > }
> > >
> > > }
> > >
> > > }
> > >
> > > Best,
> > > Taewoo
> > >
> > > On Tue, Dec 1, 2015 at 7:37 PM, Taewoo Kim (JIRA) <[email protected]>
> > wrote:
> > >
> > >>
> > >> [
> > >>
> > https://issues.apache.org/jira/browse/ASTERIXDB-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035204#comment-15035204
> > >> ]
> > >>
> > >> Taewoo Kim commented on ASTERIXDB-1208:
> > >> ---------------------------------------
> > >>
> > >> This error happens that the current tokenizer always assumes that it
> > sees
> > >> a UTF8 string. In this case, it sees a NULL value. We need to add a
> > logic
> > >> to bypass when a NULL value is provided.
> > >>
> > >> > ngram tokenizer failure with negative length
> > >> > --------------------------------------------
> > >> >
> > >> > Key: ASTERIXDB-1208
> > >> > URL:
> > >> https://issues.apache.org/jira/browse/ASTERIXDB-1208
> > >> > Project: Apache AsterixDB
> > >> > Issue Type: Bug
> > >> > Components: Hyracks Core
> > >> > Reporter: Wenhai
> > >> > Assignee: Taewoo Kim
> > >> >
> > >> > drop dataverse test if exists;
> > >> > create dataverse test;
> > >> > use dataverse test;
> > >> > create type DBLPOpenType as open {
> > >> > id: int64,
> > >> > dblpid: string,
> > >> > authors: string,
> > >> > misc: string
> > >> > }
> > >> > create dataset DBLPOpen(DBLPOpenType) primary key id;
> > >> > insert into dataset DBLPOpen { "id": 93, "dblpid":
> > >> "journals/iandc/IbarraJCR91", "authors": "Some Classes of Languages in
> > >> NC¹", "misc": "2006-04-25 86-106 Inf. Comput. January 1991 90 1
> > >> db/journals/iandc/iandc90.html#IbarraJCR91" }
> > >> > use dataverse test;
> > >> > set import-private-functions 'true'
> > >> > for $d in dataset DBLPOpen
> > >> > where
> > >>
> > similarity-jaccard(gram-tokens("",3,false),gram-tokens($d.title,3,false))
> > >> >= 0.5
> > >> > return {"rec": $d}
> > >>
> > >>
> > >>
> > >> --
> > >> This message was sent by Atlassian JIRA
> > >> (v6.3.4#6332)
> > >>
> >