Tri, Unfortunately, it depends on the input language. Only thing I've found is it may be better to find the tokens that are punctuation. A hint is most tokens that are punctuation are a single character wide. But, again that may not be the case depending on the encoding and the punctuation. Words are usually a bit longer.
James On 11/5/2011 2:14 PM, Tri Nguyen wrote: > Thank you James, > I don't count the token having pattern ".*[A-Za-z0-9]+.*" and check some > cases it works. > The token is not satisfied that pattern can be a punctuation. Is that > pattern enough to cover a keyword? > Can we incorporate Lucene and OpenNLP so that the keyword position and > Named Entity position are compatible? > > > On Sun, Nov 6, 2011 at 12:22 AM, James Kosin <[email protected]> wrote: > >> Tri, >> >> You could just subtract the number of punctuation tokens from the >> offsets you get. >> On 11/5/2011 1:08 PM, Tri Nguyen wrote: >>> On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <[email protected]> >> wrote: >>>> On 11/5/11 4:53 PM, Tri Nguyen wrote: >>>> >>>>> Obama is correct, but Bill Gates. Since the NameFinderME return the >> token >>>>> index (position in the token array) not the keyword position (the >> keyword >>>>> position in the text). I want to cooperate with keyword position in >>>>> Lucene. >>>>> >>>> What is a keyword position? >>>> >>> It is the order of the word in the text. >>> Ex: >>> Barack: 0 >>> Obama: 1 >>> president: 3 >>> US: 5 >>> he: 6 >>> 1961: 11 >>> Bill: 12 >>> >>>> Jörn >>>> >>
