Hi Alex, Thank you for sharing your experience. I did try to annotate raw XML files but there is a considerable difference regarding the number of entities annotated in this raw file with respect to the text content version so I might be interested in the approach you followed. I will have a look at your code when it will be available. Thank you for mentioning its release.
Best, Pajolma ----- Original Message ----- > From: "Alex Olieman" <[email protected]> > To: [email protected] > Cc: "pajolma rupi" <[email protected]> > Sent: Monday, June 1, 2015 2:33:47 PM > Subject: Re: [Dbp-spotlight-users] XML input format > Hi Pajolma, > Yes, I have been in a similar situation. I'm not sure if there is a more > convenient solution (from the Java/Scala code), but I ended up parsing, > annotating, and rewriting the XML. If you already intend to make annotations > a part of your XML schema, neatly annotating each element with correct > offsets is quite trivial. > See the attached XML for an example of what my output looks like. It includes > annotations from multiple systems, so to check out only those generated by > DBp Spotlight, just search the file for "Spotlight". The original XML source > (without annotations; for comparison) can be found here: > http://resolver.politicalmashup.nl/nl.proc.ob.d.h-tk-20042005-5970-5973.xml > I'm currently cleaning the code I use to do this, and will release a (partly > documented) version within two weeks. It's in Python, but may be useful as > reference implementation if you'd like to do the same in Java. > If this approach is too much work: have you tried just annotating your raw > XML files, without removing any markup? I've done this before with HTML and > XML and could get a pretty decent result by ignoring a few entities that > correspond to common tag and attribute names. > Cheers, > Alex > On 28-5-2015 13:41, Pajolma Rupi wrote: > > Dear all, > > > I am interested in running Spotlight with an XML input file format with the > > objective of enriching the content with semantic information. > > > From what I've experienced until now it seems like such format is not > > supported and that only a plain text format is supported. Am I correct? > > (I'm > > using the code here for processing text files: > > https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/eval/src/main/java/org/dbpedia/spotlight/evaluation/external/DBpediaSpotlightClient.java#L90 > > ) > > > Has anybody run into such a problem already? > > > I can of course get the text content out of the XML file (say it will > > produce > > a new plain text file) and pass this text content to Spotlight but then I > > would have that: > > > 1- the offset I would get from running the Spotlight won't be the same as > > the > > offset in the original XML file > > > 2- the enriching process will get more complicated due to the different > > offsets (XML file vs plain text file) > > > Thank you in advance, > > > Pajolma > > > Pajolma RUPI > > > Research and Development Engineer > > > Service de l'e-Information Scientifique et Multimédia (SEISM) > > > Research Centre INRIA Grenoble - Rhône-Alpes > > > 655 Avenue de l'Europe > > > 38330 Montbonnot-Saint-Martin > > > France > > > ------------------------------------------------------------------------------ > > > _______________________________________________ > > > Dbp-spotlight-users mailing list [email protected] > > https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users >
------------------------------------------------------------------------------
_______________________________________________ Dbp-spotlight-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
