Lucene for a linguistic corpus

Igor Shalyminov Sat, 05 Jan 2013 04:37:30 -0800

Hello!

I'm considering Lucene as an engine for linguistic corpus search.


There's a feature in this search: each word is treated as ambiguuos - i.e., it 
has got multiple sets of grammatical annotations (there's a fixed maximum of 
these sets number - a word can have at most 8 parses).
For an example, in the phrase "A man saw a elephant" "saw" has annotations as 
follows (we also say that its position in index is 1234):

{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}

Normally, we index each annotation as an independent feature (i.e., there will 
be posting lists for "lemma", "pos", "number", etc.). And the problem is, for 
the query "pos = Verb AND number = Singular" we DON'T want to find the position 
1234 because they appeared in different parses.

As a solution one may consider indexing all annotation subsets (this would 
increase index size and queries complicatedness), searching for regexps (but 
the search will be dead slow), or indexing parses, not words (but queries with 
given distance between words will break up) - these solutions are not 
acceptable.

I think, it would be more effective to insert parse index in each attribute's 
posting list entry as a payload and use it at the intersectiion stage. E.g., we 
have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting 
list for 'number = Singular': ...|...|2.1234|...|... While processing a query 
like 'pos = Verb AND number = singular' at all stages of posting list 
processing 'x.1234' will be accepted until the intersection stage at which they 
will be rejected because of non-corresponding parse indexes.

I am also new to Lucene, so could you please tell me if this idea is 
implementable in Lucene, and how much effort does the implementation take?


-- 
Best Regards,
Igor Shalyminov

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Lucene for a linguistic corpus

Reply via email to