Re: Part of speech search with lucene

David Villarejo Tue, 03 Mar 2015 13:42:57 -0800

What you propose is good if you want to index only the pos of a token. But
I want to index some extra info, such as "lemma" of a token, phonetic
encoding, etc. Sorry, I was not too general in my previous post.
Imagine you want to ask this:


an adj whose lemma is "quick" followed by "brown" followed by a noun whose
phonetic enconding is "fots".

So, the main problem is you cannot ask if several "synonyms" exist at the
same position.

Thank you Michael for your answer.

2015-03-03 20:52 GMT+01:00 Michael Sokolov <msoko...@safaribooksonline.com>:

> What if you indexed every word with two synonyms: the plain unadorned word
> and a token formed by concatenating the pos and the word with some unusual
> separator character?
>
> For example, "the quick brown fox" would be:
>
> { the | article:the } {quick | adj:quick } { brown | adj:brown } { fox |
> noun:fox }
>
> with punctuation to suggest the token graph
>
> -Mike
>
>
> On 03/03/2015 01:21 PM, David Villarejo wrote:
>
>> After many google searchs I decided to post my problem here hoping that
>> someone help me. What I want to achieve is to perform queries as follows
>> (Don't worry about the query format):
>>
>> q1: (adjective) "jumps" (preposition) // any adj followed by "jumps"
>> followed by any prep.
>> q2: (adjective:brown) "jumps" (preposition) // brown as adj. followed by
>> "jumps" followed by any prep.
>> q3: (adjective:brown) (verb:jumps) (preposition) // brown as adj followed
>> by jumps as verb followed by any preposition.
>>
>> In a more general form, what I want is
>> (POS[:specific_word]) (POS[:specific_word]) (POS[:specific_word])
>>
>> For that, I have the text tagged as follows:
>>
>> the|[pos:DT][lemma:the] quick|[pos:JJ][lemma:quick]
>> brown|[pos:JJ][lemma:brown] fox|[pos:NN][lemma:fox]
>> jumps|[pos:NNS][lemma:jump] over|[pos:IN][lemma:over]
>> the|[pos:DT][lemma:the] lazy|[pos:JJ][lemma:lazy] dog|[pos:NN][lemma:dog]
>>
>> The first thing I thought was to index extra info of each term as payload
>> and using PayloadNearQuery after in order to access to the payload of each
>> span. The problem is that PayloadNearQuery match the terms first and then
>> access its payload, so none of the 3 above queries will work. (correct me
>> if I'm wrong)
>>
>> The second thing I thought was to index extra info as synonyms of the term
>> but, this way, the second query won't work since I can't ask if the first
>> term is an adj and the specific word "brown" simultaneously.
>>
>> Any way to address this problem, suggestions, etc. will be appreciated.
>>
>>
>> David.
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Part of speech search with lucene

Reply via email to