Re: Using POS payloads for chunking

Erick Erickson Wed, 14 Jun 2017 14:29:57 -0700

Markus:

I don't believe that payloads are limited in size at all. LUCENE-7705
was done in part because there _was_ a hard-coded 256 limit for some
of the tokenizers. The Payload (at least recent versions) just have
some bytes after them, and (with LUCENE-7705) can be arbitrarily long.


Of course if you put anything other than a number in there you have to
provide your own decoders and the like to make sense of your
payload....

Best,
Erick (Erickson, not Hatcher)

On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma
<markus.jel...@openindex.io> wrote:
> Hello Erik,
>
> Using Solr, or actually more parts are Lucene, we have a CharFilter adding 
> treebank tags to whitespace delimited word using a delimiter, further on we 
> get these tokens with the delimiter and the POS-tag. It won't work with some 
> Tokenizers and put it before WDF, it'll split as you know. That TokenFilter 
> is configured with a tab delimited mapping config containing 
> <POS-tag>\t<bitset>, and there the bitset is encoded as payload.
>
> Our edismax extension rewrites queries to payload supported equivalents, this 
> is quite trivial, except for all those API changes in Lucene you have to put 
> up with. Finally a BM25 extension that has, amongst others, a mapping of 
> bitset to score. Nouns get a bonus, prepositions and other useless pieces get 
> a punishment etc.
>
> Payloads are really great things to use! We also use it to distinguish 
> between compounds and their subwords, o.a. we supply Dutch and German 
> speaking countries.  And stemmed words and non-stemmed words. Although the 
> latter also benefit from IDF statistics, payloads just help to control 
> boosting more precisely regardless of your corpus.
>
> I still need to take a look at your recent payload QParsers for Solr and see 
> how different, probably better, they are compared to our older 
> implementations. Although we don't use PayloadTermQParser equivalent for 
> regular search, we do use it for scoring recommendations via delimited multi 
> valued fields. Payloads are versatile!
>
> The downside of payloads is that they are limited to 8 bits. Although we can 
> easily fit our reduced treebank in there, we also use single bits to signal 
> for compound/subword, and stemmed/unstemmed and some others.
>
> Hope this helps.
>
> Regards,
> Markus
>
> -----Original message-----
>> From:Erik Hatcher <erik.hatc...@gmail.com>
>> Sent: Wednesday 14th June 2017 23:03
>> To: java-user@lucene.apache.org
>> Subject: Re: Using POS payloads for chunking
>>
>> Markus - how are you encoding payloads as bitsets and use them for scoring?  
>>  Curious to see how folks are leveraging them.
>>
>>       Erik
>>
>> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma <markus.jel...@openindex.io> 
>> > wrote:
>> >
>> > Hello,
>> >
>> > We use POS-tagging too, and encode them as payload bitsets for scoring, 
>> > which is, as far as is know, the only possibility with payloads.
>> >
>> > So, instead of encoding them as payloads, why not index your treebanks 
>> > POS-tags as tokens on the same position, like synonyms. If you do that, 
>> > you can use spans and phrase queries to find chunks of multiple POS-tags.
>> >
>> > This would be the first approach i can think of. Treating them as regular 
>> > tokens enables you to use regular search for them.
>> >
>> > Regards,
>> > Markus
>> >
>> >
>> >
>> > -----Original message-----
>> >> From:José Tomás Atria <jtat...@gmail.com>
>> >> Sent: Wednesday 14th June 2017 22:29
>> >> To: java-user@lucene.apache.org
>> >> Subject: Using POS payloads for chunking
>> >>
>> >> Hello!
>> >>
>> >> I'm not particularly familiar with lucene's search api (as I've been using
>> >> the library mostly as a dumb index rather than a search engine), but I am
>> >> almost certain that, using its payload capabilities, it would be trivial 
>> >> to
>> >> implement a regular chunker to look for patterns in sequences of payloads.
>> >>
>> >> (trying not to be too pedantic, a regular chunker looks for 'chunks' based
>> >> on part-of-speech tags, e.g. noun phrases can be searched for with 
>> >> patterns
>> >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or
>> >> more adjectives preceding a bunch of nouns, etc)
>> >>
>> >> Assuming my index has POS tags encoded as payloads for each position, how
>> >> would one search for such patterns, irrespective of terms? I started
>> >> studying the spans search API, as this seemed like the natural place to
>> >> start, but I quickly got lost.
>> >>
>> >> Any tips would be extremely appreciated. (or references to this kind of
>> >> thing, I'm sure someone must have tried something similar before...)
>> >>
>> >> thanks!
>> >> ~jta
>> >> --
>> >>
>> >> sent from a phone. please excuse terseness and tpyos.
>> >>
>> >> enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Using POS payloads for chunking

Reply via email to