Re: [Dbp-spotlight-users] XML input format

Pajolma Rupi Tue, 07 Jul 2015 02:38:00 -0700

Hi Alex, 
Thank you for making the code available. 

My use-case is same as the one you might be working on (the general-purpose 
one): I will have to give to Spotlight the text contained in a specific path of 
the XML graph. 
I will keep an eye on the evolution of your code.


Thanks once more and good luck, 
Pajolma 

P.S: sorry for answering late, I was out of office 

----- Original Message -----

> From: "Alex Olieman" <[email protected]>
> To: "Pajolma Rupi" <[email protected]>,
> [email protected]
> Sent: Monday, June 29, 2015 11:38:31 AM
> Subject: Re: [Dbp-spotlight-users] XML input format

> Hi Pajolma,

> The python code that parses XML, lets DBp Spotlight annotate the contents of
> particular tags, and stores the resulting annotations in the XML can be
> found in dbp_spotlight_xml.p y. I have made no particular effort to make
> this function usable for XML documents in general, but it is a pretty clean
> example of how XML (or HTML) processing can be done. In my document
> collection I'm only interested in annotating <p> tags, but you could easily
> enumerate several tags that are of interest to you, or select a top-level
> tag that contains the interesting text. The text contents from all children
> of the selected tags is taken into account automatically, but those tags and
> their attributes are not.

> For a bit more context on how I use this code in my project, you might want
> to see pool_annotations.py . I hope this can serve as inspiration for your
> (Java) project.

> If any other DBp Spotlight users are facing similar issues with annotating
> XML (i.e. if this is a common use-case), please speak up. I might refactor
> my current code into a general-purpose tool with a few simple configuration
> options (e.g. to select XML/HTML tags of interest).

> Kind regards,
> Alex

> On 17-6-2015 14:25, Pajolma Rupi wrote:

> > Dear all,
> 

> > I am reopening the same discussion thread because after a few
> > investigations
> > I realized that at least by using the "url" parameter, and giving an XML
> > file as input, the service answers back with a not that bad result (at
> > least
> > from the few examples I took for testing).
> 
> > Here is an example of my call (the only difference is that I'm using my
> > local
> > instance instead, i.e. http://localhost:2222/rest/ ...):
> > http://spotlight.dbpedia.org/rest/annotate/?url=http://raweb.inria.fr/rapportsactivite/RA2014/wimmics/wimmics.xml&confidence=0.3&support=100
> 

> > Still, I can see that not all the XML nodes containing text were taken into
> > consideration, meaning that the output text returned by the service doesn't
> > include all the text content contained in the XML elements/attributes. From
> > a first check it seems like only the tags existing in HTML are taken into
> > consideration (<p> is taken into consideration but not <firstName>) but I'd
> > like to be sure about it.
> 
> > Does anybody have some knowledge about the logic lying behind this
> > processing? I would like to know what should I expect as a result from
> > Spotlight when it is run with an XML file.
> 

> > @Alex
> 
> > Please let me know if your Python script dealing with the XML input format
> > is
> > already available.
> 

> > Thank you in advance,
> 
> > Pajolma
> 
> > ----- Original Message -----
> 

> > > From: "Pajolma Rupi" <[email protected]>
> > 
> 
> > > To: "Alex Olieman" <[email protected]>
> > 
> 
> > > Cc: [email protected]
> > 
> 
> > > Sent: Tuesday, June 2, 2015 11:13:44 AM
> > 
> 
> > > Subject: Re: [Dbp-spotlight-users] XML input format
> > 
> 

> > > Hi Alex,
> > 
> 
> > > Thank you for sharing your experience.
> > 
> 
> > > I did try to annotate raw XML files but there is a considerable
> > > difference
> > > regarding the number of entities annotated in this raw file with respect
> > > to
> > > the text content version so I might be interested in the approach you
> > > followed. I will have a look at your code when it will be available.
> > > Thank
> > > you for mentioning its release.
> > 
> 

> > > Best,
> > 
> 
> > > Pajolma
> > 
> 

> > > ----- Original Message -----
> > 
> 

> > > > From: "Alex Olieman" <[email protected]>
> > > 
> > 
> 
> > > > To: [email protected]
> > > 
> > 
> 
> > > > Cc: "pajolma rupi" <[email protected]>
> > > 
> > 
> 
> > > > Sent: Monday, June 1, 2015 2:33:47 PM
> > > 
> > 
> 
> > > > Subject: Re: [Dbp-spotlight-users] XML input format
> > > 
> > 
> 

> > > > Hi Pajolma,
> > > 
> > 
> 

> > > > Yes, I have been in a similar situation. I'm not sure if there is a
> > > > more
> > > > convenient solution (from the Java/Scala code), but I ended up parsing,
> > > > annotating, and rewriting the XML. If you already intend to make
> > > > annotations
> > > > a part of your XML schema, neatly annotating each element with correct
> > > > offsets is quite trivial.
> > > 
> > 
> 

> > > > See the attached XML for an example of what my output looks like. It
> > > > includes
> > > > annotations from multiple systems, so to check out only those generated
> > > > by
> > > > DBp Spotlight, just search the file for "Spotlight". The original XML
> > > > source
> > > > (without annotations; for comparison) can be found here:
> > > > http://resolver.politicalmashup.nl/nl.proc.ob.d.h-tk-20042005-5970-5973.xml
> > > 
> > 
> 

> > > > I'm currently cleaning the code I use to do this, and will release a
> > > > (partly
> > > > documented) version within two weeks. It's in Python, but may be useful
> > > > as
> > > > reference implementation if you'd like to do the same in Java.
> > > 
> > 
> 

> > > > If this approach is too much work: have you tried just annotating your
> > > > raw
> > > > XML files, without removing any markup? I've done this before with HTML
> > > > and
> > > > XML and could get a pretty decent result by ignoring a few entities
> > > > that
> > > > correspond to common tag and attribute names.
> > > 
> > 
> 

> > > > Cheers,
> > > 
> > 
> 
> > > > Alex
> > > 
> > 
> 

> > > > On 28-5-2015 13:41, Pajolma Rupi wrote:
> > > 
> > 
> 

> > > > > Dear all,
> > > > 
> > > 
> > 
> 

> > > > > I am interested in running Spotlight with an XML input file format
> > > > > with
> > > > > the
> > > > > objective of enriching the content with semantic information.
> > > > 
> > > 
> > 
> 
> > > > > From what I've experienced until now it seems like such format is not
> > > > > supported and that only a plain text format is supported. Am I
> > > > > correct?
> > > > > (I'm
> > > > > using the code here for processing text files:
> > > > > https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/eval/src/main/java/org/dbpedia/spotlight/evaluation/external/DBpediaSpotlightClient.java#L90
> > > > > )
> > > > 
> > > 
> > 
> 
> > > > > Has anybody run into such a problem already?
> > > > 
> > > 
> > 
> 

> > > > > I can of course get the text content out of the XML file (say it will
> > > > > produce
> > > > > a new plain text file) and pass this text content to Spotlight but
> > > > > then
> > > > > I
> > > > > would have that:
> > > > 
> > > 
> > 
> 
> > > > > 1- the offset I would get from running the Spotlight won't be the
> > > > > same
> > > > > as
> > > > > the
> > > > > offset in the original XML file
> > > > 
> > > 
> > 
> 
> > > > > 2- the enriching process will get more complicated due to the
> > > > > different
> > > > > offsets (XML file vs plain text file)
> > > > 
> > > 
> > 
> 

> > > > > Thank you in advance,
> > > > 
> > > 
> > 
> 
> > > > > Pajolma
> > > > 
> > > 
> > 
> 

> > > > > Pajolma RUPI
> > > > 
> > > 
> > 
> 

> > > > > Research and Development Engineer
> > > > 
> > > 
> > 
> 

> > > > > Service de l'e-Information Scientifique et Multimédia (SEISM)
> > > > 
> > > 
> > 
> 
> > > > > Research Centre INRIA Grenoble - Rhône-Alpes
> > > > 
> > > 
> > 
> 

> > > > > 655 Avenue de l'Europe
> > > > 
> > > 
> > 
> 

> > > > > 38330 Montbonnot-Saint-Martin
> > > > 
> > > 
> > 
> 

> > > > > France
> > > > 
> > > 
> > 
> 

> > > > > ------------------------------------------------------------------------------
> > > > 
> > > 
> > 
> 

> > > > > _______________________________________________
> > > > 
> > > 
> > 
> 
> > > > > Dbp-spotlight-users mailing list
> > > > > [email protected]
> > > > > https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
> > > > 
> > > 
> > 
> 

> > > ------------------------------------------------------------------------------
> > 
> 

> > > _______________________________________________
> > 
> 
> > > Dbp-spotlight-users mailing list
> > 
> 
> > > [email protected]
> > 
> 
> > > https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
> > 
>

------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/

_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Re: [Dbp-spotlight-users] XML input format

Reply via email to