Re: [Dbpedia-discussion] Dbpedia 3.8 extraction. Problem

Batica Dzonic Wed, 29 Aug 2012 00:41:55 -0700

yeeeeeaa finally :) this work around works :) thanks Omri

--- On Tue, 8/28/12, Omri Oren <[email protected]> wrote:

From: Omri Oren <[email protected]>
Subject: Re: [Dbpedia-discussion] Dbpedia 3.8 extraction. Problem
To: "Batica Dzonic" <[email protected]>
Cc: [email protected]
Date: Tuesday, August 28, 2012, 12:14 AM

Yes, that's the error I was talking about.
I fixed it with the following lines (marked) added to WikipediaDumpParser.java :

      else if (isStartElement(CONTRIBUTOR_ELEM))      {        // Check if this 
is an empty (deleted) contributor tag (i.e. <contributor deleted="deleted" /> )

        // which has no explicit </contributor> end element. If it is - skip 
it.        String deleted = _reader.getAttributeValue(null, "deleted");

        if (deleted != null && deleted.equals("deleted")) {          nextTag();

        } else {            // now at <contributor>, move to next tag           
 nextTag();            // now should have ip / (author & id), when ip is 
present we don't have author / id

            // TODO Create a getElementName function to make this cleaner       
     if (isStartElement(CONTRIBUTOR_IP)) {              contributorID = "0";    
          contributorName = readString(CONTRIBUTOR_IP, false);

            }            else            {              // usually we have 
contributor name first but we have to check              if 
(isStartElement(CONTRIBUTOR_NAME))

              {                contributorName = readString(CONTRIBUTOR_NAME, 
false);                nextTag();                if 
(isStartElement(CONTRIBUTOR_ID))                  contributorID = 
readString(CONTRIBUTOR_ID, false);

              }              else              {                // when 
contributor ID is first                if (isStartElement(CONTRIBUTOR_ID))      
          {

                  contributorID = readString(CONTRIBUTOR_ID, false);            
      nextTag();                  if (isStartElement(CONTRIBUTOR_NAME))         
           contributorName = readString(CONTRIBUTOR_NAME, false);

                }              }            }            nextTag();
            requireEndElement(CONTRIBUTOR_ELEM);        }

      }
I'm sure there's a more general / elegant way to identify an XML tag with no 
matching closing tag (i.e. <contributor deleted="deleted" /> ) but this 
solution worked for me with the new Wiki dumps and I couldn't find anything 
better in the existing interface.

Also - I don't know if this should enter the official code repository, and if 
it should, how to do that. Anyone from the team wants to help here?
Anyway, if you find these solutions have flaws, or have any insights about 
working with DBpedia 3.8, please share :)

Omri

On Tue, Aug 28, 2012 at 2:03 AM, Batica Dzonic <[email protected]> wrote:

Thanks Omri

I tried by replacing all places in code with "export-0.7" but I got the 
following error

".............Caused by: javax.xml.stream.XMLStreamException: ParseError at 
[row,col]:[11300680,16]Message: expected </contributor>

        at

org.dbpedia.util.text.xml.XMLStreamUtils.requireElement(XMLStreamUtils.java:115)at org.dbpedia.util.text.xml.XMLStreamUtils.requireEndElement(XMLStreamUtils.java:96)............

"I hope that bugfix will resolve the problems...
Thanks again Batica Dzonic
--- On Mon, 8/27/12, Omri Oren <[email protected]> wrote:

From: Omri Oren <[email protected]>
Subject: Re: [Dbpedia-discussion] Dbpedia 3.8 extraction. Problem

To: "Batica Dzonic" <[email protected]>
Cc: [email protected]

Date: Monday, August 27, 2012, 3:08 PM

Hi Batica,
I had the same problem. I fixed it by replacing all places in the code that 
contain the string "export-0.6" with "export-0.7", including data files, e.g. 
*.xml

I don't know what changes were added to the schema in 0.7, but it seemed to 
work with the extractors that I tried.
(Does anyone here who knows the code / schema better, know if there's anything 
else to do to match version 0.7 of the schema?)
In addition, there's a bugfix to do in the xml parser code (problem when the 
wikipage's <contributor> is "deleted"), I'll send it when I get to the office 
tomorrow.
Omri

On Aug 28, 2012 12:04 AM, "Batica Dzonic" <[email protected]> wrote:

Hello,

I'm working on a project that I need form my master's work. I use DBpedia as a 
knowledge base. Hence, data stored in official DBpedia can become outdated I 
need to re-extract Wikipedia articles. I try to setup DBpedia dump extractor, 
but without success.

I tried with dbpedia 3.8 framework extractor but there is a problem with 
processing latest wikipedia dump (  Here's the problem with version of xml 
schema as I can see) :

Caused by:
 javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,249]Message: 
expected <mediawiki> with namespace [http://www.mediawiki.org/xml/export-0.6/], 
found [http://www.mediawiki.org/xml/export-0.7/]

Is there a solution to this problem? I appreciate any kind of help..

Sorry for my bad English

Batica Dzonic

------------------------------------------------------------------------------

Live Security Virtual Conference

Exclusive live event will cover all the ways today's security and

threat landscape has changed and how IT managers can respond. Discussions

will include endpoint security, mobile security and the latest in malware

threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________

Dbpedia-discussion mailing list

[email protected]

https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Dbpedia 3.8 extraction. Problem

Reply via email to