Hi Pajolma,
Yes, I have been in a similar situation. I'm not sure if there is a more
convenient solution (from the Java/Scala code), but I ended up parsing,
annotating, and rewriting the XML. If you already intend to make
annotations a part of your XML schema, neatly annotating each element
with correct offsets is quite trivial.
See this XML
<https://www.dropbox.com/s/4uxl7zw6ffxnp88/nl.proc.ob.d.h-tk-20042005-5970-5973.xml?dl=0>
for an example of what my output looks like. It includes annotations
from multiple systems, so to check out only those generated by DBp
Spotlight, just search the file for "Spotlight". The original XML source
(without annotations; for comparison) can be found here:
http://resolver.politicalmashup.nl/nl.proc.ob.d.h-tk-20042005-5970-5973.xml
I'm currently cleaning the code I use to do this, and will release a
(partly documented) version within two weeks. It's in Python, but may be
useful as reference implementation if you'd like to do the same in Java.
If this approach is too much work: have you tried just annotating your
raw XML files, without removing any markup? I've done this before with
HTML and XML and could get a pretty decent result by ignoring a few
entities that correspond to common tag and attribute names.
Cheers,
Alex
On 28-5-2015 13:41, Pajolma Rupi wrote:
Dear all,
I am interested in running Spotlight with an XML input file format
with the objective of enriching the content with semantic information.
From what I've experienced until now it seems like such format is not
supported and that only a plain text format is supported. Am I
correct? (I'm using the code here for processing text files:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/eval/src/main/java/org/dbpedia/spotlight/evaluation/external/DBpediaSpotlightClient.java#L90 )
Has anybody run into such a problem already?
I can of course get the text content out of the XML file (say it will
produce a new plain text file) and pass this text content to Spotlight
but then I would have that:
1- the offset I would get from running the Spotlight won't be the same
as the offset in the original XML file
2- the enriching process will get more complicated due to the
different offsets (XML file vs plain text file)
Thank you in advance,
Pajolma
*/Pajolma RUPI/*
Research and Development Engineer
Service de l'e-Information Scientifique et Multimédia (SEISM)
Research Centre INRIA Grenoble - Rhône-Alpes
/655 Avenue de l'Europe/
/38330 Montbonnot-Saint-Martin/
/France/
------------------------------------------------------------------------------
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
------------------------------------------------------------------------------
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users