RE: Using POS payloads for chunking

Markus Jelsma Wed, 14 Jun 2017 14:33:20 -0700

Hello Erick, no worries, i recognize you two.

I will take a look at your references tomorrow. Although i am still fine with 
eight bits, i cannot spare any more but one. If Lucene allows us to pass longer 
bitsets to the BytesRef, it would be awesome and easy to encode.


Thanks!
Markus
 
-----Original message-----
> From:Erick Erickson <[email protected]>
> Sent: Wednesday 14th June 2017 23:29
> To: java-user <[email protected]>
> Subject: Re: Using POS payloads for chunking
> 
> Markus:
> 
> I don't believe that payloads are limited in size at all. LUCENE-7705
> was done in part because there _was_ a hard-coded 256 limit for some
> of the tokenizers. The Payload (at least recent versions) just have
> some bytes after them, and (with LUCENE-7705) can be arbitrarily long.
> 
> Of course if you put anything other than a number in there you have to
> provide your own decoders and the like to make sense of your
> payload....
> 
> Best,
> Erick (Erickson, not Hatcher)
> 
> On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma
> <[email protected]> wrote:
> > Hello Erik,
> >
> > Using Solr, or actually more parts are Lucene, we have a CharFilter adding 
> > treebank tags to whitespace delimited word using a delimiter, further on we 
> > get these tokens with the delimiter and the POS-tag. It won't work with 
> > some Tokenizers and put it before WDF, it'll split as you know. That 
> > TokenFilter is configured with a tab delimited mapping config containing 
> > <POS-tag>\t<bitset>, and there the bitset is encoded as payload.
> >
> > Our edismax extension rewrites queries to payload supported equivalents, 
> > this is quite trivial, except for all those API changes in Lucene you have 
> > to put up with. Finally a BM25 extension that has, amongst others, a 
> > mapping of bitset to score. Nouns get a bonus, prepositions and other 
> > useless pieces get a punishment etc.
> >
> > Payloads are really great things to use! We also use it to distinguish 
> > between compounds and their subwords, o.a. we supply Dutch and German 
> > speaking countries.  And stemmed words and non-stemmed words. Although the 
> > latter also benefit from IDF statistics, payloads just help to control 
> > boosting more precisely regardless of your corpus.
> >
> > I still need to take a look at your recent payload QParsers for Solr and 
> > see how different, probably better, they are compared to our older 
> > implementations. Although we don't use PayloadTermQParser equivalent for 
> > regular search, we do use it for scoring recommendations via delimited 
> > multi valued fields. Payloads are versatile!
> >
> > The downside of payloads is that they are limited to 8 bits. Although we 
> > can easily fit our reduced treebank in there, we also use single bits to 
> > signal for compound/subword, and stemmed/unstemmed and some others.
> >
> > Hope this helps.
> >
> > Regards,
> > Markus
> >
> > -----Original message-----
> >> From:Erik Hatcher <[email protected]>
> >> Sent: Wednesday 14th June 2017 23:03
> >> To: [email protected]
> >> Subject: Re: Using POS payloads for chunking
> >>
> >> Markus - how are you encoding payloads as bitsets and use them for 
> >> scoring?   Curious to see how folks are leveraging them.
> >>
> >>       Erik
> >>
> >> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma <[email protected]> 
> >> > wrote:
> >> >
> >> > Hello,
> >> >
> >> > We use POS-tagging too, and encode them as payload bitsets for scoring, 
> >> > which is, as far as is know, the only possibility with payloads.
> >> >
> >> > So, instead of encoding them as payloads, why not index your treebanks 
> >> > POS-tags as tokens on the same position, like synonyms. If you do that, 
> >> > you can use spans and phrase queries to find chunks of multiple POS-tags.
> >> >
> >> > This would be the first approach i can think of. Treating them as 
> >> > regular tokens enables you to use regular search for them.
> >> >
> >> > Regards,
> >> > Markus
> >> >
> >> >
> >> >
> >> > -----Original message-----
> >> >> From:José Tomás Atria <[email protected]>
> >> >> Sent: Wednesday 14th June 2017 22:29
> >> >> To: [email protected]
> >> >> Subject: Using POS payloads for chunking
> >> >>
> >> >> Hello!
> >> >>
> >> >> I'm not particularly familiar with lucene's search api (as I've been 
> >> >> using
> >> >> the library mostly as a dumb index rather than a search engine), but I 
> >> >> am
> >> >> almost certain that, using its payload capabilities, it would be 
> >> >> trivial to
> >> >> implement a regular chunker to look for patterns in sequences of 
> >> >> payloads.
> >> >>
> >> >> (trying not to be too pedantic, a regular chunker looks for 'chunks' 
> >> >> based
> >> >> on part-of-speech tags, e.g. noun phrases can be searched for with 
> >> >> patterns
> >> >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or
> >> >> more adjectives preceding a bunch of nouns, etc)
> >> >>
> >> >> Assuming my index has POS tags encoded as payloads for each position, 
> >> >> how
> >> >> would one search for such patterns, irrespective of terms? I started
> >> >> studying the spans search API, as this seemed like the natural place to
> >> >> start, but I quickly got lost.
> >> >>
> >> >> Any tips would be extremely appreciated. (or references to this kind of
> >> >> thing, I'm sure someone must have tried something similar before...)
> >> >>
> >> >> thanks!
> >> >> ~jta
> >> >> --
> >> >>
> >> >> sent from a phone. please excuse terseness and tpyos.
> >> >>
> >> >> enviado desde un teléfono. por favor disculpe la parquedad y los 
> >> >> erroers.
> >> >>
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [email protected]
> >> > For additional commands, e-mail: [email protected]
> >> >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Using POS payloads for chunking

Reply via email to