Markus: I don't believe that payloads are limited in size at all. LUCENE-7705 was done in part because there _was_ a hard-coded 256 limit for some of the tokenizers. The Payload (at least recent versions) just have some bytes after them, and (with LUCENE-7705) can be arbitrarily long.
Of course if you put anything other than a number in there you have to provide your own decoders and the like to make sense of your payload.... Best, Erick (Erickson, not Hatcher) On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma <[email protected]> wrote: > Hello Erik, > > Using Solr, or actually more parts are Lucene, we have a CharFilter adding > treebank tags to whitespace delimited word using a delimiter, further on we > get these tokens with the delimiter and the POS-tag. It won't work with some > Tokenizers and put it before WDF, it'll split as you know. That TokenFilter > is configured with a tab delimited mapping config containing > <POS-tag>\t<bitset>, and there the bitset is encoded as payload. > > Our edismax extension rewrites queries to payload supported equivalents, this > is quite trivial, except for all those API changes in Lucene you have to put > up with. Finally a BM25 extension that has, amongst others, a mapping of > bitset to score. Nouns get a bonus, prepositions and other useless pieces get > a punishment etc. > > Payloads are really great things to use! We also use it to distinguish > between compounds and their subwords, o.a. we supply Dutch and German > speaking countries. And stemmed words and non-stemmed words. Although the > latter also benefit from IDF statistics, payloads just help to control > boosting more precisely regardless of your corpus. > > I still need to take a look at your recent payload QParsers for Solr and see > how different, probably better, they are compared to our older > implementations. Although we don't use PayloadTermQParser equivalent for > regular search, we do use it for scoring recommendations via delimited multi > valued fields. Payloads are versatile! > > The downside of payloads is that they are limited to 8 bits. Although we can > easily fit our reduced treebank in there, we also use single bits to signal > for compound/subword, and stemmed/unstemmed and some others. > > Hope this helps. > > Regards, > Markus > > -----Original message----- >> From:Erik Hatcher <[email protected]> >> Sent: Wednesday 14th June 2017 23:03 >> To: [email protected] >> Subject: Re: Using POS payloads for chunking >> >> Markus - how are you encoding payloads as bitsets and use them for scoring? >> Curious to see how folks are leveraging them. >> >> Erik >> >> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma <[email protected]> >> > wrote: >> > >> > Hello, >> > >> > We use POS-tagging too, and encode them as payload bitsets for scoring, >> > which is, as far as is know, the only possibility with payloads. >> > >> > So, instead of encoding them as payloads, why not index your treebanks >> > POS-tags as tokens on the same position, like synonyms. If you do that, >> > you can use spans and phrase queries to find chunks of multiple POS-tags. >> > >> > This would be the first approach i can think of. Treating them as regular >> > tokens enables you to use regular search for them. >> > >> > Regards, >> > Markus >> > >> > >> > >> > -----Original message----- >> >> From:José Tomás Atria <[email protected]> >> >> Sent: Wednesday 14th June 2017 22:29 >> >> To: [email protected] >> >> Subject: Using POS payloads for chunking >> >> >> >> Hello! >> >> >> >> I'm not particularly familiar with lucene's search api (as I've been using >> >> the library mostly as a dumb index rather than a search engine), but I am >> >> almost certain that, using its payload capabilities, it would be trivial >> >> to >> >> implement a regular chunker to look for patterns in sequences of payloads. >> >> >> >> (trying not to be too pedantic, a regular chunker looks for 'chunks' based >> >> on part-of-speech tags, e.g. noun phrases can be searched for with >> >> patterns >> >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or >> >> more adjectives preceding a bunch of nouns, etc) >> >> >> >> Assuming my index has POS tags encoded as payloads for each position, how >> >> would one search for such patterns, irrespective of terms? I started >> >> studying the spans search API, as this seemed like the natural place to >> >> start, but I quickly got lost. >> >> >> >> Any tips would be extremely appreciated. (or references to this kind of >> >> thing, I'm sure someone must have tried something similar before...) >> >> >> >> thanks! >> >> ~jta >> >> -- >> >> >> >> sent from a phone. please excuse terseness and tpyos. >> >> >> >> enviado desde un teléfono. por favor disculpe la parquedad y los erroers. >> >> >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: [email protected] >> > For additional commands, e-mail: [email protected] >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
