Re: [Dbp-spotlight-users] Time performance for each phase

David Przybilla Thu, 11 Jun 2015 01:51:12 -0700

Hi Pajolma,

Taking a better look it could be that the easiest  would be to measure the
time in the rest module.


so if you take a look at [1] you can see all the classes which matches the
endpoints.
I think if you just wrap one of the methods with the code measuring the
time, and make the right request to spotlight  depending on the method that
you wrapped (i.e: post, get, html, xml..etc) then It would be
straightforward :)


[1]
https://github.com/dbpedia-spotlight/dbpedia-spotlight/tree/master/rest/src/main/java/org/dbpedia/spotlight/web/rest/resources

On Thu, Jun 11, 2015 at 9:23 AM, Pajolma Rupi <[email protected]> wrote:

> Hello David,
> Can I ask you for a few more hints on how to isolate the calls for
> `candidates` , `annotate` and `spot` while using the SpotlightInterface?
>
> Here is what I tried (still using the jar file via JAVA):
>
> SpotlightInterface si=new SpotlightInterface("annotate/");
> String text="Sherlcok Holmes plot took place in 1891, in London.";
> String inUrl="http://localhost:2222/rest/annotate/";;
> double confidence=0.0;
> int support=10;
> String dbpediaTypesString="";
> String sparqlQuery="";
> String policy="whitelist";
> boolean coreferenceResolution=true;
> String clientIp="";
> String spotterName="CoOccurrenceBasedSelector";
> String disambiguator="DefaultDisambiguator";
>
> String result=si.getJSON(text, inUrl, confidence, support,
> dbpediaTypesString, sparqlQuery, policy, coreferenceResolution, clientIp,
> spotterName, disambiguator);
>
>
> But I get the following error:
> *Exception in thread "main" org.dbpedia.spotlight.**exceptions.InputException:
> No spotters were loaded. Please add one of []....*
>
> Could you please let me know which is the right spotter and disambiguator
> (I'm trying to guess the next problem I might have with the diambiguator
> name :) ) I should use in order to get the correct results?
>
> @Alex
> Thanks a lot for your comments! My concern is that I'm not sure the
> 'candidates' endpoint and the 'disambiguate' one are totally separated
> between each other in the sense that I've the impression that a part of the
> disambiguation logic might be already performed during the candidates
> generation (contextual score calculation) which makes me doubt about the
> significance of just comparing the different endpoints time performance...
> Let me know if you see it differently.
> You're right, I am using the statistical version but the paper you're
> pointing ("Improving efficiency and accuracy in multilingual entity
> extraction") isn't giving me enough helpful information.
>
> Thank you in advance,
> Pajolma
>
> ------------------------------
>
> *From: *"Alex Olieman" <[email protected]>
> *To: *[email protected]
> *Sent: *Tuesday, June 9, 2015 12:51:41 PM
>
> *Subject: *Re: [Dbp-spotlight-users] Time performance for each phase
>
> Hi Pajolma,
>
> You may want to adjust what you measure. Both the annotate and candidate
> endpoints encompass spotting, so it is entirely expected that
> spot+candidates takes longer than only annotate. In the (old) IR-based
> implementation (from the paper you cite) you may be able to make sense of
> this by comparing the timings of spot+disambiguate with annotate. Their
> total time of completion should be (roughly) equivalent if I understand
> correctly. I'm not sure if this is the same for the newer statistical
> version, but it can at least be verified this way.
>
> The difference between annotate and candidates is merely that annotate
> selects the candidate with the highest disambiguation score that passes the
> confidence threshold. That should make for an insignificant difference in
> runtime. Someone please correct me if I'm wrong.
>
> David's suggestion to use SpotlightInterface to measure the timings of the
> various pipelines seems the way to go if you want to do these measurements
> cleanly from your Java code. I'm not a Java dev, however, so for API usage
> tips many of the other subscribers to this list would have a better chance
> at helping with that.
>
> For more background on how the faster statistical implementation (which
> you are most likely using) builds its models, and what it needs to do at
> runtime for spotting and disambiguation, please see Joachim Daiber et al.
> "Improving efficiency and accuracy in multilingual entity extraction". I've
> also written about this for the ERD'14 challenge:
> http://www.e.humanities.uva.nl/publications/2014/olie:enti14.pdf
>
> There is at least a relevant difference between language-independent and
> language-dependent spotting, which is configurable. The first,
> lexicon-based Aho-Corasick spotting, should be significantly faster than
> OpenNLP spotting. Intuitively, I would say that disambiguation should take
> longer than spotting, but Jo et al. did an exceptional job at fitting the
> models in memory and speeding this up. So, I'm also very interested in what
> you will discover!
>
> Best of luck,
>
> Alex
>
> On 9-6-2015 9:57, Pajolma Rupi wrote:
>
> Hi David,
>
> Yes, my objective was to test the running time for each endpoint, so that
> I have an idea about the phase that takes longer during the annotation
> process.
> I ran a few tests with small text files and it seems like the phrase
> spotting phase (spot endpoint + candidates endpoint) takes longer in
> comparison to the disambiguation one (annotate endpoint). My explanation
> would be that during the disambiguation phase, it's only the contextual
> score that is taken into account (if I understood it right from the paper 
> *DBpedia
> spotlight shedding light on the web of documents : *the resource with the
> biggest contextual score is chosen) and this score is already calculated
> during the phrase spotting (more precisely during the candidate generation
> sub-phase). Given this fact, the disambiguation consists of just choosing
> the resource with the biggest contextual score and takes  much less time
> than the phrase spotting one. Please let me know if you have a different
> opinion on the matter.
>
> Best,
> Pajolma
>
>
> ------------------------------
>
> *From: *"David Przybilla" <[email protected]>
> <[email protected]>
> *To: *"Pajolma Rupi" <[email protected]> <[email protected]>
> *Cc: *[email protected]
> *Sent: *Friday, June 5, 2015 10:19:07 AM
> *Subject: *Re: [Dbp-spotlight-users] Time performance for each phase
>
> Hi Pajolma,
>
> Sorry, I miss understood "performance" :) and  Ithought we were talking
> about the quality of the extractions.
>
> If it is benchmarking time, then I guess yes, you could call the given
> endpoints and subtract the time.
>
> Other possibility is for you take a look at SpotlightInterface which
> encode all the pipelines for `candidates` , `annotate` and `spot`, then
> isolate the calls, passing some testing set that you could provide.
>
>
>
>
> On Thu, Jun 4, 2015 at 4:30 PM, Pajolma Rupi <[email protected]>
> wrote:
>
>> Hi David,
>>
>> I managed to find the kore50 corpus but not the milne-witten one. Do you
>> know if it's still publicly available?
>>
>> In order to test the time performance of each phase, I was thinking to
>> use the available endpoints:
>>
>> 1-spot
>> 2-candidates
>> 3-disambiguate
>> 4-annotate
>>
>> Because for using the *disambiguate* endpoint I would have to provide NE
>> annotations in my call I was thinking to use the *annotate* endpoint
>> instead and subtract the time consumed by the *candidates* endpoint in
>> order to be able to get the time consumed by the disambiguation phase.
>> Would such logic be correct with respect to the implementation? Is there
>> any other phase in the pipeline (between disambiguation and annotation)
>> which might affect this logic? If I understood it well, the pipeline
>> consists of the processing done by each of the endpoints in the order that
>> I've listed them above. Please let me know if it is not the case.
>>
>> Thank you in advance,
>> Pajolma
>>
>> ------------------------------
>>
>> *From: *"David Przybilla" <[email protected]>
>> *To: *"Pajolma Rupi" <[email protected]>
>> *Cc: *[email protected]
>> *Sent: *Tuesday, June 2, 2015 6:45:19 PM
>> *Subject: *Re: [Dbp-spotlight-users] Time performance for each phase
>>
>>
>> Hi Pajolma,
>>
>> As far as I know there are no separate evaluations out of the box, but
>> you could use the milne-witten corpus to evaluate only the spottter and
>> disambiguation separately.
>>
>> In my experience problems are usually related to spotting: surface forms
>> which are not in the models, surface forms without enough probability.
>>
>> There is also specific corpus for evaluating disambiguation (kore50)
>>
>>
>>
>> On Tue, Jun 2, 2015 at 1:58 PM, Pajolma Rupi <[email protected]>
>> wrote:
>>
>>> Dear all,
>>>
>>> I was not able to find some information regarding the time performance
>>> of Spotlight service for each of the phases (separately): phrase spotting
>>> (candidate generation, candidate selection), disambiguation, indexing.There
>>> are some numbers present in the paper "*Improving efficiency and
>>> accuracy in multilingual entity extraction*" but they are calculated in
>>> the context of all the annotation process, meanwhile I'm interested in
>>> knowing during which specific phase the service performs better and during
>>> which phase it performs worse.
>>>
>>> Could you please let me know if such information exists already?
>>> I would also be interested in knowing if I can produce such information
>>> by running my own local instance of Spotlight (I'm using Java in order to
>>> annotate text).
>>>
>>> Thank you in advance,
>>> Pajolma
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Dbp-spotlight-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>>>
>>>
>>
>>
>
>
>
> ------------------------------------------------------------------------------
>
>
>
> _______________________________________________
> Dbp-spotlight-users mailing 
> [email protected]https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Dbp-spotlight-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>
>
>

------------------------------------------------------------------------------

_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Re: [Dbp-spotlight-users] Time performance for each phase

Reply via email to