Re: [Dbp-spotlight-users] Time performance for each phase

Alex Olieman Tue, 09 Jun 2015 03:52:40 -0700

Hi Pajolma,

You may want to adjust what you measure. Both the annotate and candidateendpoints encompass spotting, so it is entirely expected thatspot+candidates takes longer than only annotate. In the (old) IR-basedimplementation (from the paper you cite) you may be able to make senseof this by comparing the timings of spot+disambiguate with annotate.Their total time of completion should be (roughly) equivalent if Iunderstand correctly. I'm not sure if this is the same for the newerstatistical version, but it can at least be verified this way.

The difference between annotate and candidates is merely that annotateselects the candidate with the highest disambiguation score that passesthe confidence threshold. That should make for an insignificantdifference in runtime. Someone please correct me if I'm wrong.

David's suggestion to use SpotlightInterface to measure the timings ofthe various pipelines seems the way to go if you want to do thesemeasurements cleanly from your Java code. I'm not a Java dev, however,so for API usage tips many of the other subscribers to this list wouldhave a better chance at helping with that.

For more background on how the faster statistical implementation (whichyou are most likely using) builds its models, and what it needs to do atruntime for spotting and disambiguation, please see Joachim Daiber etal. "Improving efficiency and accuracy in multilingual entityextraction". I've also written about this for the ERD'14 challenge:http://www.e.humanities.uva.nl/publications/2014/olie:enti14.pdf

There is at least a relevant difference between language-independent andlanguage-dependent spotting, which is configurable. The first,lexicon-based Aho-Corasick spotting, should be significantly faster thanOpenNLP spotting. Intuitively, I would say that disambiguation shouldtake longer than spotting, but Jo et al. did an exceptional job atfitting the models in memory and speeding this up. So, I'm also veryinterested in what you will discover!


Best of luck,

Alex

On 9-6-2015 9:57, Pajolma Rupi wrote:

Hi David,

Yes, my objective was to test the running time for each endpoint, sothat I have an idea about the phase that takes longer during theannotation process.I ran a few tests with small text files and it seems like the phrasespotting phase (spot endpoint + candidates endpoint) takes longer incomparison to the disambiguation one (annotate endpoint). Myexplanation would be that during the disambiguation phase, it's onlythe contextual score that is taken into account (if I understood itright from the paper *DBpedia spotlight shedding light on the web ofdocuments : *the resource with the biggest contextual score is chosen)and this score is already calculated during the phrase spotting (moreprecisely during the candidate generation sub-phase). Given this fact,the disambiguation consists of just choosing the resource with thebiggest contextual score and takes much less time than the phrasespotting one. Please let me know if you have a different opinion onthe matter.


Best,
Pajolma


------------------------------------------------------------------------

    *From: *"David Przybilla" <[email protected]>
    *To: *"Pajolma Rupi" <[email protected]>
    *Cc: *[email protected]
    *Sent: *Friday, June 5, 2015 10:19:07 AM
    *Subject: *Re: [Dbp-spotlight-users] Time performance for each phase

    Hi Pajolma,

    Sorry, I miss understood "performance" :) and  Ithought we were
    talking about the quality of the extractions.

    If it is benchmarking time, then I guess yes, you could call the
    given endpoints and subtract the time.

    Other possibility is for you take a look at SpotlightInterface
    which encode all the pipelines for `candidates` , `annotate` and
    `spot`, then isolate the calls, passing some testing set that you
    could provide.




    On Thu, Jun 4, 2015 at 4:30 PM, Pajolma Rupi
    <[email protected] <mailto:[email protected]>> wrote:

        Hi David,

        I managed to find the kore50 corpus but not the milne-witten
        one. Do you know if it's still publicly available?

        In order to test the time performance of each phase, I was
        thinking to use the available endpoints:

        1-spot
        2-candidates
        3-disambiguate
        4-annotate

        Because for using the *disambiguate* endpoint I would have to
        provide NE annotations in my call I was thinking to use the
        *annotate* endpoint instead and subtract the time consumed by
        the *candidates* endpoint in order to be able to get the time
        consumed by the disambiguation phase. Would such logic be
        correct with respect to the implementation? Is there any other
        phase in the pipeline (between disambiguation and annotation)
        which might affect this logic? If I understood it well, the
        pipeline consists of the processing done by each of the
        endpoints in the order that I've listed them above. Please let
        me know if it is not the case.

        Thank you in advance,
        Pajolma

        ------------------------------------------------------------------------

            *From: *"David Przybilla" <[email protected]
            <mailto:[email protected]>>
            *To: *"Pajolma Rupi" <[email protected]
            <mailto:[email protected]>>
            *Cc: *[email protected]
            <mailto:[email protected]>
            *Sent: *Tuesday, June 2, 2015 6:45:19 PM
            *Subject: *Re: [Dbp-spotlight-users] Time performance for
            each phase


            Hi Pajolma,

            As far as I know there are no separate evaluations out of
            the box, but you could use the milne-witten corpus to
            evaluate only the spottter and disambiguation separately.

            In my experience problems are usually related to spotting:
            surface forms which are not in the models, surface forms
            without enough probability.

            There is also specific corpus for evaluating
            disambiguation (kore50)


            On Tue, Jun 2, 2015 at 1:58 PM, Pajolma Rupi
            <[email protected] <mailto:[email protected]>> wrote:

                Dear all,

                I was not able to find some information regarding the
                time performance of Spotlight service for each of the
                phases (separately): phrase spotting (candidate
                generation, candidate selection), disambiguation,
                indexing.There are some numbers present in the paper
                "*Improving efficiency and accuracy in multilingual
                entity extraction*" but they are calculated in the
                context of all the annotation process, meanwhile I'm
                interested in knowing during which specific phase the
                service performs better and during which phase it
                performs worse.

                Could you please let me know if such information
                exists already?
                I would also be interested in knowing if I can produce
                such information by running my own local instance of
                Spotlight (I'm using Java in order to annotate text).

                Thank you in advance,
                Pajolma

                
------------------------------------------------------------------------------

                _______________________________________________
                Dbp-spotlight-users mailing list
                [email protected]
                <mailto:[email protected]>
                https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users







------------------------------------------------------------------------------


_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

------------------------------------------------------------------------------

_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Re: [Dbp-spotlight-users] Time performance for each phase

Reply via email to