Hi Pajolma,
You may want to adjust what you measure. Both the annotate and candidate
endpoints encompass spotting, so it is entirely expected that
spot+candidates takes longer than only annotate. In the (old) IR-based
implementation (from the paper you cite) you may be able to make sense
of this by comparing the timings of spot+disambiguate with annotate.
Their total time of completion should be (roughly) equivalent if I
understand correctly. I'm not sure if this is the same for the newer
statistical version, but it can at least be verified this way.
The difference between annotate and candidates is merely that annotate
selects the candidate with the highest disambiguation score that passes
the confidence threshold. That should make for an insignificant
difference in runtime. Someone please correct me if I'm wrong.
David's suggestion to use SpotlightInterface to measure the timings of
the various pipelines seems the way to go if you want to do these
measurements cleanly from your Java code. I'm not a Java dev, however,
so for API usage tips many of the other subscribers to this list would
have a better chance at helping with that.
For more background on how the faster statistical implementation (which
you are most likely using) builds its models, and what it needs to do at
runtime for spotting and disambiguation, please see Joachim Daiber et
al. "Improving efficiency and accuracy in multilingual entity
extraction". I've also written about this for the ERD'14 challenge:
http://www.e.humanities.uva.nl/publications/2014/olie:enti14.pdf
There is at least a relevant difference between language-independent and
language-dependent spotting, which is configurable. The first,
lexicon-based Aho-Corasick spotting, should be significantly faster than
OpenNLP spotting. Intuitively, I would say that disambiguation should
take longer than spotting, but Jo et al. did an exceptional job at
fitting the models in memory and speeding this up. So, I'm also very
interested in what you will discover!
Best of luck,
Alex
On 9-6-2015 9:57, Pajolma Rupi wrote:
Hi David,
Yes, my objective was to test the running time for each endpoint, so
that I have an idea about the phase that takes longer during the
annotation process.
I ran a few tests with small text files and it seems like the phrase
spotting phase (spot endpoint + candidates endpoint) takes longer in
comparison to the disambiguation one (annotate endpoint). My
explanation would be that during the disambiguation phase, it's only
the contextual score that is taken into account (if I understood it
right from the paper *DBpedia spotlight shedding light on the web of
documents : *the resource with the biggest contextual score is chosen)
and this score is already calculated during the phrase spotting (more
precisely during the candidate generation sub-phase). Given this fact,
the disambiguation consists of just choosing the resource with the
biggest contextual score and takes much less time than the phrase
spotting one. Please let me know if you have a different opinion on
the matter.
Best,
Pajolma
------------------------------------------------------------------------
*From: *"David Przybilla" <[email protected]>
*To: *"Pajolma Rupi" <[email protected]>
*Cc: *[email protected]
*Sent: *Friday, June 5, 2015 10:19:07 AM
*Subject: *Re: [Dbp-spotlight-users] Time performance for each phase
Hi Pajolma,
Sorry, I miss understood "performance" :) and Ithought we were
talking about the quality of the extractions.
If it is benchmarking time, then I guess yes, you could call the
given endpoints and subtract the time.
Other possibility is for you take a look at SpotlightInterface
which encode all the pipelines for `candidates` , `annotate` and
`spot`, then isolate the calls, passing some testing set that you
could provide.
On Thu, Jun 4, 2015 at 4:30 PM, Pajolma Rupi
<[email protected] <mailto:[email protected]>> wrote:
Hi David,
I managed to find the kore50 corpus but not the milne-witten
one. Do you know if it's still publicly available?
In order to test the time performance of each phase, I was
thinking to use the available endpoints:
1-spot
2-candidates
3-disambiguate
4-annotate
Because for using the *disambiguate* endpoint I would have to
provide NE annotations in my call I was thinking to use the
*annotate* endpoint instead and subtract the time consumed by
the *candidates* endpoint in order to be able to get the time
consumed by the disambiguation phase. Would such logic be
correct with respect to the implementation? Is there any other
phase in the pipeline (between disambiguation and annotation)
which might affect this logic? If I understood it well, the
pipeline consists of the processing done by each of the
endpoints in the order that I've listed them above. Please let
me know if it is not the case.
Thank you in advance,
Pajolma
------------------------------------------------------------------------
*From: *"David Przybilla" <[email protected]
<mailto:[email protected]>>
*To: *"Pajolma Rupi" <[email protected]
<mailto:[email protected]>>
*Cc: *[email protected]
<mailto:[email protected]>
*Sent: *Tuesday, June 2, 2015 6:45:19 PM
*Subject: *Re: [Dbp-spotlight-users] Time performance for
each phase
Hi Pajolma,
As far as I know there are no separate evaluations out of
the box, but you could use the milne-witten corpus to
evaluate only the spottter and disambiguation separately.
In my experience problems are usually related to spotting:
surface forms which are not in the models, surface forms
without enough probability.
There is also specific corpus for evaluating
disambiguation (kore50)
On Tue, Jun 2, 2015 at 1:58 PM, Pajolma Rupi
<[email protected] <mailto:[email protected]>> wrote:
Dear all,
I was not able to find some information regarding the
time performance of Spotlight service for each of the
phases (separately): phrase spotting (candidate
generation, candidate selection), disambiguation,
indexing.There are some numbers present in the paper
"*Improving efficiency and accuracy in multilingual
entity extraction*" but they are calculated in the
context of all the annotation process, meanwhile I'm
interested in knowing during which specific phase the
service performs better and during which phase it
performs worse.
Could you please let me know if such information
exists already?
I would also be interested in knowing if I can produce
such information by running my own local instance of
Spotlight (I'm using Java in order to annotate text).
Thank you in advance,
Pajolma
------------------------------------------------------------------------------
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
------------------------------------------------------------------------------
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
------------------------------------------------------------------------------
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users