Hi Pajolma,

You may want to adjust what you measure. Both the annotate and candidate endpoints encompass spotting, so it is entirely expected that spot+candidates takes longer than only annotate. In the (old) IR-based implementation (from the paper you cite) you may be able to make sense of this by comparing the timings of spot+disambiguate with annotate. Their total time of completion should be (roughly) equivalent if I understand correctly. I'm not sure if this is the same for the newer statistical version, but it can at least be verified this way.

The difference between annotate and candidates is merely that annotate selects the candidate with the highest disambiguation score that passes the confidence threshold. That should make for an insignificant difference in runtime. Someone please correct me if I'm wrong.

David's suggestion to use SpotlightInterface to measure the timings of the various pipelines seems the way to go if you want to do these measurements cleanly from your Java code. I'm not a Java dev, however, so for API usage tips many of the other subscribers to this list would have a better chance at helping with that.

For more background on how the faster statistical implementation (which you are most likely using) builds its models, and what it needs to do at runtime for spotting and disambiguation, please see Joachim Daiber et al. "Improving efficiency and accuracy in multilingual entity extraction". I've also written about this for the ERD'14 challenge: http://www.e.humanities.uva.nl/publications/2014/olie:enti14.pdf

There is at least a relevant difference between language-independent and language-dependent spotting, which is configurable. The first, lexicon-based Aho-Corasick spotting, should be significantly faster than OpenNLP spotting. Intuitively, I would say that disambiguation should take longer than spotting, but Jo et al. did an exceptional job at fitting the models in memory and speeding this up. So, I'm also very interested in what you will discover!

Best of luck,

Alex

On 9-6-2015 9:57, Pajolma Rupi wrote:
Hi David,

Yes, my objective was to test the running time for each endpoint, so that I have an idea about the phase that takes longer during the annotation process. I ran a few tests with small text files and it seems like the phrase spotting phase (spot endpoint + candidates endpoint) takes longer in comparison to the disambiguation one (annotate endpoint). My explanation would be that during the disambiguation phase, it's only the contextual score that is taken into account (if I understood it right from the paper *DBpedia spotlight shedding light on the web of documents : *the resource with the biggest contextual score is chosen) and this score is already calculated during the phrase spotting (more precisely during the candidate generation sub-phase). Given this fact, the disambiguation consists of just choosing the resource with the biggest contextual score and takes much less time than the phrase spotting one. Please let me know if you have a different opinion on the matter.

Best,
Pajolma


------------------------------------------------------------------------

    *From: *"David Przybilla" <[email protected]>
    *To: *"Pajolma Rupi" <[email protected]>
    *Cc: *[email protected]
    *Sent: *Friday, June 5, 2015 10:19:07 AM
    *Subject: *Re: [Dbp-spotlight-users] Time performance for each phase

    Hi Pajolma,

    Sorry, I miss understood "performance" :) and  Ithought we were
    talking about the quality of the extractions.

    If it is benchmarking time, then I guess yes, you could call the
    given endpoints and subtract the time.

    Other possibility is for you take a look at SpotlightInterface
    which encode all the pipelines for `candidates` , `annotate` and
    `spot`, then isolate the calls, passing some testing set that you
    could provide.




    On Thu, Jun 4, 2015 at 4:30 PM, Pajolma Rupi
    <[email protected] <mailto:[email protected]>> wrote:

        Hi David,

        I managed to find the kore50 corpus but not the milne-witten
        one. Do you know if it's still publicly available?

        In order to test the time performance of each phase, I was
        thinking to use the available endpoints:

        1-spot
        2-candidates
        3-disambiguate
        4-annotate

        Because for using the *disambiguate* endpoint I would have to
        provide NE annotations in my call I was thinking to use the
        *annotate* endpoint instead and subtract the time consumed by
        the *candidates* endpoint in order to be able to get the time
        consumed by the disambiguation phase. Would such logic be
        correct with respect to the implementation? Is there any other
        phase in the pipeline (between disambiguation and annotation)
        which might affect this logic? If I understood it well, the
        pipeline consists of the processing done by each of the
        endpoints in the order that I've listed them above. Please let
        me know if it is not the case.

        Thank you in advance,
        Pajolma

        ------------------------------------------------------------------------

            *From: *"David Przybilla" <[email protected]
            <mailto:[email protected]>>
            *To: *"Pajolma Rupi" <[email protected]
            <mailto:[email protected]>>
            *Cc: *[email protected]
            <mailto:[email protected]>
            *Sent: *Tuesday, June 2, 2015 6:45:19 PM
            *Subject: *Re: [Dbp-spotlight-users] Time performance for
            each phase


            Hi Pajolma,

            As far as I know there are no separate evaluations out of
            the box, but you could use the milne-witten corpus to
            evaluate only the spottter and disambiguation separately.

            In my experience problems are usually related to spotting:
            surface forms which are not in the models, surface forms
            without enough probability.

            There is also specific corpus for evaluating
            disambiguation (kore50)


            On Tue, Jun 2, 2015 at 1:58 PM, Pajolma Rupi
            <[email protected] <mailto:[email protected]>> wrote:

                Dear all,

                I was not able to find some information regarding the
                time performance of Spotlight service for each of the
                phases (separately): phrase spotting (candidate
                generation, candidate selection), disambiguation,
                indexing.There are some numbers present in the paper
                "*Improving efficiency and accuracy in multilingual
                entity extraction*" but they are calculated in the
                context of all the annotation process, meanwhile I'm
                interested in knowing during which specific phase the
                service performs better and during which phase it
                performs worse.

                Could you please let me know if such information
                exists already?
                I would also be interested in knowing if I can produce
                such information by running my own local instance of
                Spotlight (I'm using Java in order to annotate text).

                Thank you in advance,
                Pajolma

                
------------------------------------------------------------------------------

                _______________________________________________
                Dbp-spotlight-users mailing list
                [email protected]
                <mailto:[email protected]>
                https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users







------------------------------------------------------------------------------


_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

------------------------------------------------------------------------------
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Reply via email to