It is English. I am using Lucene StandardAnalyzer, it index the words at correct positions. Can we map the token position from OpenNLP to Lucene?
Tri. On Sun, Nov 6, 2011 at 7:28 AM, James Kosin <[email protected]> wrote: > Tri, > > Unfortunately, it depends on the input language. Only thing I've found > is it may be better to find the tokens that are punctuation. A hint is > most tokens that are punctuation are a single character wide. But, > again that may not be the case depending on the encoding and the > punctuation. Words are usually a bit longer. > > James > > On 11/5/2011 2:14 PM, Tri Nguyen wrote: > > Thank you James, > > I don't count the token having pattern ".*[A-Za-z0-9]+.*" and check some > > cases it works. > > The token is not satisfied that pattern can be a punctuation. Is that > > pattern enough to cover a keyword? > > Can we incorporate Lucene and OpenNLP so that the keyword position and > > Named Entity position are compatible? > > > > > > On Sun, Nov 6, 2011 at 12:22 AM, James Kosin <[email protected]> > wrote: > > > >> Tri, > >> > >> You could just subtract the number of punctuation tokens from the > >> offsets you get. > >> On 11/5/2011 1:08 PM, Tri Nguyen wrote: > >>> On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <[email protected]> > >> wrote: > >>>> On 11/5/11 4:53 PM, Tri Nguyen wrote: > >>>> > >>>>> Obama is correct, but Bill Gates. Since the NameFinderME return the > >> token > >>>>> index (position in the token array) not the keyword position (the > >> keyword > >>>>> position in the text). I want to cooperate with keyword position in > >>>>> Lucene. > >>>>> > >>>> What is a keyword position? > >>>> > >>> It is the order of the word in the text. > >>> Ex: > >>> Barack: 0 > >>> Obama: 1 > >>> president: 3 > >>> US: 5 > >>> he: 6 > >>> 1961: 11 > >>> Bill: 12 > >>> > >>>> Jörn > >>>> > >> > >
