I was able to do it a few days ago.

Here is the logic on how I fixed the problem:

It is simple once you understand how the NERs spans work which is different
than the sentence and token spans which return the actual character index
on the string.

The Span for sentences are the actual string positions while the one for
the NERs is the ordinal number of the token, so [3...5[ means tokens 3 & 4
only.

Then I calculate than by taken the tokenizer spans which give the position
of the actual tokens(words) and then was able to select span[3] and span[4]
from token and get the start from the lowest token and the end from the
highest. That is how I did it in case some one needs to do the same.


On Wed, May 21, 2014 at 8:12 AM, Michael Lieberman <
[email protected]> wrote:

> Hi Carlos, you should be able to get the substring of the text covered by
> the start position of the first span, to the end position of the last span.
> This will include whatever number of spaces between the spans.
>
>
> On Tue, May 20, 2014 at 5:08 AM, Carlos Scheidecker <[email protected]
> >wrote:
>
> > Hello all,
> >
> > I have a question here and documentation was not very helpful.
> >
> > I want to extract the position of an entity referent to the sentence.
> >
> > If I run the name finder I will get tokens and Spans.
> >
> > The thing is, I though the getStart and getEnd function on the spans
> where
> > the actual character being and character end.
> >
> > But what it looks like is that it is the beginning token number and end
> > token number instead.
> >
> > So it seems that if you have a set of tokens of your sentence such as:
> >
> > [Former, first, lady, Nancy, Reagan]
> >
> > Then the span for the name entity Nancy Reagan would be Start = 3, End =
> 5.
> >
> > That means Start 3 for the Span is the 4th token on that Array which is
> > Nancy.
> >
> > Then the End value is 5 which means covers from [3,5[ or [3,4] where 5 is
> > exclusive.
> >
> > Therefore, if I were to calculate the position of Nancy Reagan I would
> need
> > to get the text covered on the Span, that is content of it, in this case
> > would be covered by 2 Spans.
> >
> > So if I do :
> >
> > StringBuilder cb = new StringBuilder();
> > for (int ti = sentenceAnnotations.get(si).getSpan().getStart(); ti <
> >  sentenceAnnotations.get(si).getSpan().getEnd(); ti++) {
> > cb.append(tokens[ti]).append(" ");
> > }
> >
> > I can get the value "Nancy Reagan" on the cb variable.
> >
> > I could do a string search but it would fail big time if, for some odd
> > instance, there are 2 spaces between the tokens Nancy and Reagan.
> >
> > Therefore, how can get the start and end characters for this Entity
> "Nancy
> > Reagan" on this sentence if there are situations where there might be
> more
> > than one space between tokens?
> >
> > What is a good approach to mark the position of an entity on it?
> >
> > Thanks,
> >
> > Carlos.
> >
>

Reply via email to