Hi Pajolma,

The python code that parses XML, lets DBp Spotlight annotate the contents of particular tags, and stores the resulting annotations in the XML can be found in dbp_spotlight_xml.p <https://bitbucket.org/aolieman/pm_el_tools/src/5472985f061ea646a82d974dba69e412e0b534b9/pm_el_tools/evaluation/dbp_spotlight_xml.py?at=master>y. I have made no particular effort to make this function usable for XML documents in general, but it is a pretty clean example of how XML (or HTML) processing can be done. In my document collection I'm only interested in annotating <p> tags, but you could easily enumerate several tags that are of interest to you, or select a top-level tag that contains the interesting text. The text contents from all children of the selected tags is taken into account automatically, but those tags and their attributes are not.

For a bit more context on how I use this code in my project, you might want to see pool_annotations.py <https://bitbucket.org/aolieman/pm_el_tools/src/5472985f061ea646a82d974dba69e412e0b534b9/pm_el_tools/evaluation/pool_annotations.py?at=master>. I hope this can serve as inspiration for your (Java) project.

If any other DBp Spotlight users are facing similar issues with annotating XML (i.e. if this is a common use-case), please speak up. I might refactor my current code into a general-purpose tool with a few simple configuration options (e.g. to select XML/HTML tags of interest).

Kind regards,
Alex

On 17-6-2015 14:25, Pajolma Rupi wrote:
Dear all,

I am reopening the same discussion thread because after a few investigations I realized that at least by using the "url" parameter, and giving an XML file as input, the service answers back with a not that bad result (at least from the few examples I took for testing). Here is an example of my call (the only difference is that I'm using my local instance instead, i.e. http://localhost:2222/rest/...): http://spotlight.dbpedia.org/rest/annotate/?url=http://raweb.inria.fr/rapportsactivite/RA2014/wimmics/wimmics.xml&confidence=0.3&support=100

Still, I can see that not all the XML nodes containing text were taken into consideration, meaning that the output text returned by the service doesn't include all the text content contained in the XML elements/attributes. From a first check it seems like only the tags existing in HTML are taken into consideration (<p> is taken into consideration but not <firstName>) but I'd like to be sure about it. Does anybody have some knowledge about the logic lying behind this processing? I would like to know what should I expect as a result from Spotlight when it is run with an XML file.

@Alex
Please let me know if your Python script dealing with the XML input format is already available.

Thank you in advance,
Pajolma
------------------------------------------------------------------------

    *From: *"Pajolma Rupi" <[email protected]>
    *To: *"Alex Olieman" <[email protected]>
    *Cc: *[email protected]
    *Sent: *Tuesday, June 2, 2015 11:13:44 AM
    *Subject: *Re: [Dbp-spotlight-users] XML input format

    Hi Alex,
    Thank you for sharing your experience.
    I did try to annotate raw XML files but there is a considerable
    difference regarding the number of entities annotated in this raw
    file with respect to the text content version so I might be
    interested in the approach you followed. I will have a look at
    your code when it will be available. Thank you for mentioning its
    release.

    Best,
    Pajolma

    ------------------------------------------------------------------------

        *From: *"Alex Olieman" <[email protected]>
        *To: *[email protected]
        *Cc: *"pajolma rupi" <[email protected]>
        *Sent: *Monday, June 1, 2015 2:33:47 PM
        *Subject: *Re: [Dbp-spotlight-users]  XML input format

        Hi Pajolma,

        Yes, I have been in a similar situation. I'm not sure if there
        is a more convenient solution (from the Java/Scala code), but
        I ended up parsing, annotating, and rewriting the XML. If you
        already intend to make annotations a part of your XML schema,
        neatly annotating each element with correct offsets is quite
        trivial.

        See the attached XML for an example of what my output looks
        like.  It includes annotations from multiple systems, so to
        check out only those generated by DBp Spotlight, just search
        the file for "Spotlight". The original XML source (without
        annotations; for comparison) can be found here:
        
http://resolver.politicalmashup.nl/nl.proc.ob.d.h-tk-20042005-5970-5973.xml

        I'm currently cleaning the code I use to do this, and will
        release a (partly documented) version within two weeks. It's
        in Python, but may be useful as reference implementation if
        you'd like to do the same in Java.

        If this approach is too much work: have you tried just
        annotating your raw XML files, without removing any markup?
        I've done this before with HTML and XML and could get a pretty
        decent result by ignoring a few entities that correspond to
        common tag and attribute names.

        Cheers,
        Alex

        On 28-5-2015 13:41, Pajolma Rupi wrote:

            Dear all,

            I am interested in running Spotlight with an XML input
            file format with the objective of enriching the content
            with semantic information.
            From what I've experienced until now it seems like such
            format is not supported and that only a plain text format
            is supported. Am I correct? (I'm using the code here for
            processing text files:
            
https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/eval/src/main/java/org/dbpedia/spotlight/evaluation/external/DBpediaSpotlightClient.java#L90
 )
            Has anybody run into such a problem already?

            I can of course get the text content out of the XML file
            (say it will produce a new plain text file) and pass this
            text content to Spotlight but then I would have that:
            1- the offset I would get from running the Spotlight won't
            be the same as the offset in the original XML file
            2- the enriching process will get more complicated due to
            the different offsets (XML file vs plain text file)

            Thank you in advance,
            Pajolma

            */Pajolma RUPI/*

            Research and Development Engineer

            Service de l'e-Information Scientifique et Multimédia (SEISM)
            Research Centre INRIA Grenoble - Rhône-Alpes

            /655 Avenue de l'Europe/

            /38330 Montbonnot-Saint-Martin/

            /France/



            
------------------------------------------------------------------------------



            _______________________________________________
            Dbp-spotlight-users mailing list
            [email protected]
            https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users




    
------------------------------------------------------------------------------

    _______________________________________________
    Dbp-spotlight-users mailing list
    [email protected]
    https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users



------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical & virtual servers, alerts via email & sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Reply via email to