Hi Pajolma,

Yes, I have been in a similar situation. I'm not sure if there is a more convenient solution (from the Java/Scala code), but I ended up parsing, annotating, and rewriting the XML. If you already intend to make annotations a part of your XML schema, neatly annotating each element with correct offsets is quite trivial.

See this XML <https://www.dropbox.com/s/4uxl7zw6ffxnp88/nl.proc.ob.d.h-tk-20042005-5970-5973.xml?dl=0> for an example of what my output looks like. It includes annotations from multiple systems, so to check out only those generated by DBp Spotlight, just search the file for "Spotlight". The original XML source (without annotations; for comparison) can be found here: http://resolver.politicalmashup.nl/nl.proc.ob.d.h-tk-20042005-5970-5973.xml

I'm currently cleaning the code I use to do this, and will release a (partly documented) version within two weeks. It's in Python, but may be useful as reference implementation if you'd like to do the same in Java.

If this approach is too much work: have you tried just annotating your raw XML files, without removing any markup? I've done this before with HTML and XML and could get a pretty decent result by ignoring a few entities that correspond to common tag and attribute names.

Cheers,
Alex

On 28-5-2015 13:41, Pajolma Rupi wrote:
Dear all,

I am interested in running Spotlight with an XML input file format with the objective of enriching the content with semantic information. From what I've experienced until now it seems like such format is not supported and that only a plain text format is supported. Am I correct? (I'm using the code here for processing text files: https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/eval/src/main/java/org/dbpedia/spotlight/evaluation/external/DBpediaSpotlightClient.java#L90 )
Has anybody run into such a problem already?

I can of course get the text content out of the XML file (say it will produce a new plain text file) and pass this text content to Spotlight but then I would have that: 1- the offset I would get from running the Spotlight won't be the same as the offset in the original XML file 2- the enriching process will get more complicated due to the different offsets (XML file vs plain text file)

Thank you in advance,
Pajolma

*/Pajolma RUPI/*

Research and Development Engineer

Service de l'e-Information Scientifique et Multimédia (SEISM)
Research Centre INRIA Grenoble - Rhône-Alpes

/655 Avenue de l'Europe/

/38330 Montbonnot-Saint-Martin/

/France/



------------------------------------------------------------------------------


_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

------------------------------------------------------------------------------
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Reply via email to