Hi Alex, 
Thank you for sharing your experience. 
I did try to annotate raw XML files but there is a considerable difference 
regarding the number of entities annotated in this raw file with respect to the 
text content version so I might be interested in the approach you followed. I 
will have a look at your code when it will be available. Thank you for 
mentioning its release. 

Best, 
Pajolma 

----- Original Message -----

> From: "Alex Olieman" <[email protected]>
> To: [email protected]
> Cc: "pajolma rupi" <[email protected]>
> Sent: Monday, June 1, 2015 2:33:47 PM
> Subject: Re: [Dbp-spotlight-users] XML input format

> Hi Pajolma,

> Yes, I have been in a similar situation. I'm not sure if there is a more
> convenient solution (from the Java/Scala code), but I ended up parsing,
> annotating, and rewriting the XML. If you already intend to make annotations
> a part of your XML schema, neatly annotating each element with correct
> offsets is quite trivial.

> See the attached XML for an example of what my output looks like. It includes
> annotations from multiple systems, so to check out only those generated by
> DBp Spotlight, just search the file for "Spotlight". The original XML source
> (without annotations; for comparison) can be found here:
> http://resolver.politicalmashup.nl/nl.proc.ob.d.h-tk-20042005-5970-5973.xml

> I'm currently cleaning the code I use to do this, and will release a (partly
> documented) version within two weeks. It's in Python, but may be useful as
> reference implementation if you'd like to do the same in Java.

> If this approach is too much work: have you tried just annotating your raw
> XML files, without removing any markup? I've done this before with HTML and
> XML and could get a pretty decent result by ignoring a few entities that
> correspond to common tag and attribute names.

> Cheers,
> Alex

> On 28-5-2015 13:41, Pajolma Rupi wrote:

> > Dear all,
> 

> > I am interested in running Spotlight with an XML input file format with the
> > objective of enriching the content with semantic information.
> 
> > From what I've experienced until now it seems like such format is not
> > supported and that only a plain text format is supported. Am I correct?
> > (I'm
> > using the code here for processing text files:
> > https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/eval/src/main/java/org/dbpedia/spotlight/evaluation/external/DBpediaSpotlightClient.java#L90
> > )
> 
> > Has anybody run into such a problem already?
> 

> > I can of course get the text content out of the XML file (say it will
> > produce
> > a new plain text file) and pass this text content to Spotlight but then I
> > would have that:
> 
> > 1- the offset I would get from running the Spotlight won't be the same as
> > the
> > offset in the original XML file
> 
> > 2- the enriching process will get more complicated due to the different
> > offsets (XML file vs plain text file)
> 

> > Thank you in advance,
> 
> > Pajolma
> 

> > Pajolma RUPI
> 

> > Research and Development Engineer
> 

> > Service de l'e-Information Scientifique et Multimédia (SEISM)
> 
> > Research Centre INRIA Grenoble - Rhône-Alpes
> 

> > 655 Avenue de l'Europe
> 

> > 38330 Montbonnot-Saint-Martin
> 

> > France
> 

> > ------------------------------------------------------------------------------
> 

> > _______________________________________________
> 
> > Dbp-spotlight-users mailing list [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
> 
------------------------------------------------------------------------------
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Reply via email to