Hi Pajolma,
The python code that parses XML, lets DBp Spotlight annotate the
contents of particular tags, and stores the resulting annotations in the
XML can be found in dbp_spotlight_xml.p
<https://bitbucket.org/aolieman/pm_el_tools/src/5472985f061ea646a82d974dba69e412e0b534b9/pm_el_tools/evaluation/dbp_spotlight_xml.py?at=master>y.
I have made no particular effort to make this function usable for XML
documents in general, but it is a pretty clean example of how XML (or
HTML) processing can be done. In my document collection I'm only
interested in annotating <p> tags, but you could easily enumerate
several tags that are of interest to you, or select a top-level tag that
contains the interesting text. The text contents from all children of
the selected tags is taken into account automatically, but those tags
and their attributes are not.
For a bit more context on how I use this code in my project, you might
want to see pool_annotations.py
<https://bitbucket.org/aolieman/pm_el_tools/src/5472985f061ea646a82d974dba69e412e0b534b9/pm_el_tools/evaluation/pool_annotations.py?at=master>.
I hope this can serve as inspiration for your (Java) project.
If any other DBp Spotlight users are facing similar issues with
annotating XML (i.e. if this is a common use-case), please speak up. I
might refactor my current code into a general-purpose tool with a few
simple configuration options (e.g. to select XML/HTML tags of interest).
Kind regards,
Alex
On 17-6-2015 14:25, Pajolma Rupi wrote:
Dear all,
I am reopening the same discussion thread because after a few
investigations I realized that at least by using the "url" parameter,
and giving an XML file as input, the service answers back with a not
that bad result (at least from the few examples I took for testing).
Here is an example of my call (the only difference is that I'm using
my local instance instead, i.e. http://localhost:2222/rest/...):
http://spotlight.dbpedia.org/rest/annotate/?url=http://raweb.inria.fr/rapportsactivite/RA2014/wimmics/wimmics.xml&confidence=0.3&support=100
Still, I can see that not all the XML nodes containing text were taken
into consideration, meaning that the output text returned by the
service doesn't include all the text content contained in the XML
elements/attributes. From a first check it seems like only the tags
existing in HTML are taken into consideration (<p> is taken into
consideration but not <firstName>) but I'd like to be sure about it.
Does anybody have some knowledge about the logic lying behind this
processing? I would like to know what should I expect as a result from
Spotlight when it is run with an XML file.
@Alex
Please let me know if your Python script dealing with the XML input
format is already available.
Thank you in advance,
Pajolma
------------------------------------------------------------------------
*From: *"Pajolma Rupi" <[email protected]>
*To: *"Alex Olieman" <[email protected]>
*Cc: *[email protected]
*Sent: *Tuesday, June 2, 2015 11:13:44 AM
*Subject: *Re: [Dbp-spotlight-users] XML input format
Hi Alex,
Thank you for sharing your experience.
I did try to annotate raw XML files but there is a considerable
difference regarding the number of entities annotated in this raw
file with respect to the text content version so I might be
interested in the approach you followed. I will have a look at
your code when it will be available. Thank you for mentioning its
release.
Best,
Pajolma
------------------------------------------------------------------------
*From: *"Alex Olieman" <[email protected]>
*To: *[email protected]
*Cc: *"pajolma rupi" <[email protected]>
*Sent: *Monday, June 1, 2015 2:33:47 PM
*Subject: *Re: [Dbp-spotlight-users] XML input format
Hi Pajolma,
Yes, I have been in a similar situation. I'm not sure if there
is a more convenient solution (from the Java/Scala code), but
I ended up parsing, annotating, and rewriting the XML. If you
already intend to make annotations a part of your XML schema,
neatly annotating each element with correct offsets is quite
trivial.
See the attached XML for an example of what my output looks
like. It includes annotations from multiple systems, so to
check out only those generated by DBp Spotlight, just search
the file for "Spotlight". The original XML source (without
annotations; for comparison) can be found here:
http://resolver.politicalmashup.nl/nl.proc.ob.d.h-tk-20042005-5970-5973.xml
I'm currently cleaning the code I use to do this, and will
release a (partly documented) version within two weeks. It's
in Python, but may be useful as reference implementation if
you'd like to do the same in Java.
If this approach is too much work: have you tried just
annotating your raw XML files, without removing any markup?
I've done this before with HTML and XML and could get a pretty
decent result by ignoring a few entities that correspond to
common tag and attribute names.
Cheers,
Alex
On 28-5-2015 13:41, Pajolma Rupi wrote:
Dear all,
I am interested in running Spotlight with an XML input
file format with the objective of enriching the content
with semantic information.
From what I've experienced until now it seems like such
format is not supported and that only a plain text format
is supported. Am I correct? (I'm using the code here for
processing text files:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/eval/src/main/java/org/dbpedia/spotlight/evaluation/external/DBpediaSpotlightClient.java#L90
)
Has anybody run into such a problem already?
I can of course get the text content out of the XML file
(say it will produce a new plain text file) and pass this
text content to Spotlight but then I would have that:
1- the offset I would get from running the Spotlight won't
be the same as the offset in the original XML file
2- the enriching process will get more complicated due to
the different offsets (XML file vs plain text file)
Thank you in advance,
Pajolma
*/Pajolma RUPI/*
Research and Development Engineer
Service de l'e-Information Scientifique et Multimédia (SEISM)
Research Centre INRIA Grenoble - Rhône-Alpes
/655 Avenue de l'Europe/
/38330 Montbonnot-Saint-Martin/
/France/
------------------------------------------------------------------------------
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
------------------------------------------------------------------------------
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors
network devices and physical & virtual servers, alerts via email & sms
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users