Re: [Dbpedia-discussion] Dbpedia 3.8 extraction. Problem

Omri Oren Tue, 28 Aug 2012 00:16:22 -0700

Yes, that's the error I was talking about.

I fixed it with the following lines (*marked*) added to
WikipediaDumpParser.java :


      else if (isStartElement(CONTRIBUTOR_ELEM))
      {
*        // Check if this is an empty (deleted) contributor tag (i.e.
<contributor deleted="deleted" /> )*
*        // which has no explicit </contributor> end element. If it is -
skip it.*
*        String deleted = _reader.getAttributeValue(null, "deleted");*
*        if (deleted != null && deleted.equals("deleted")) {*
*          nextTag();*
*        } else {*
            // now at <contributor>, move to next tag
            nextTag();
            // now should have ip / (author & id), when ip is present we
don't have author / id
            // TODO Create a getElementName function to make this cleaner
            if (isStartElement(CONTRIBUTOR_IP)) {
              contributorID = "0";
              contributorName = readString(CONTRIBUTOR_IP, false);
            }
            else
            {
              // usually we have contributor name first but we have to check
              if (isStartElement(CONTRIBUTOR_NAME))
              {
                contributorName = readString(CONTRIBUTOR_NAME, false);
                nextTag();
                if (isStartElement(CONTRIBUTOR_ID))
                  contributorID = readString(CONTRIBUTOR_ID, false);
              }
              else
              {
                // when contributor ID is first
                if (isStartElement(CONTRIBUTOR_ID))
                {
                  contributorID = readString(CONTRIBUTOR_ID, false);
                  nextTag();
                  if (isStartElement(CONTRIBUTOR_NAME))
                    contributorName = readString(CONTRIBUTOR_NAME, false);
                }
              }
            }
            nextTag();

            requireEndElement(CONTRIBUTOR_ELEM);
*        }*
      }

I'm sure there's a more general / elegant way to identify an XML tag with
no matching closing tag (i.e. <contributor deleted="deleted" /> ) but this
solution worked for me with the new Wiki dumps and I couldn't find anything
better in the existing interface.

Also - I don't know if this should enter the official code repository, and
if it should, how to do that. *Anyone from the team wants to help here?*

Anyway, if you find these solutions have flaws, or have any insights about
working with DBpedia 3.8, please share :)

Omri



On Tue, Aug 28, 2012 at 2:03 AM, Batica Dzonic <[email protected]> wrote:

> Thanks Omri
>
> I tried by replacing all places in code with "export-0.7" but I got the
> following error
>
> ".............
> Caused by: javax.xml.stream.XMLStreamException: ParseError at
> [row,col]:[11300680,16]
> Message: expected </contributor>
>
> at
> org.dbpedia.util.text.xml.XMLStreamUtils.requireElement(XMLStreamUtils.java:115)
> at
>
>  
> org.dbpedia.util.text.xml.XMLStreamUtils.requireEndElement(XMLStreamUtils.java:96)
> ............
> "
> I hope that bugfix will resolve the problems...
>
> Thanks again
> Batica Dzonic
>
> --- On *Mon, 8/27/12, Omri Oren <[email protected]>* wrote:
>
>
> From: Omri Oren <[email protected]>
> Subject: Re: [Dbpedia-discussion] Dbpedia 3.8 extraction. Problem
> To: "Batica Dzonic" <[email protected]>
> Cc: [email protected]
> Date: Monday, August 27, 2012, 3:08 PM
>
>
> Hi Batica,
>
> I had the same problem. I fixed it by replacing all places in the code
> that contain the string "export-0.6" with "export-0.7", including data
> files, e.g. *.xml
> I don't know what changes were added to the schema in 0.7, but it seemed
> to work with the extractors that I tried.
>
> (Does anyone here who knows the code / schema better, know if there's
> anything else to do to match version 0.7 of the schema?)
>
> In addition, there's a bugfix to do in the xml parser code (problem when
> the wikipage's <contributor> is "deleted"), I'll send it when I get to the
> office tomorrow.
>
> Omri
>  On Aug 28, 2012 12:04 AM, "Batica Dzonic" 
> <[email protected]<http://mc/[email protected]>>
> wrote:
>
>  Hello,
> I'm working on a project that I need form my master's work. I use DBpedia
> as a knowledge base. Hence, data stored in official DBpedia can become
> outdated I need to re-extract Wikipedia articles. I try to setup DBpedia
> dump extractor, but without success.
> I tried with dbpedia 3.8 framework extractor but there is a problem with
> processing latest wikipedia dump (  Here's the problem with version of
> xml schema as I can see) :
>
> *Caused by: javax.xml.stream.XMLStreamException: ParseError at
> [row,col]:[1,249]*
> *Message: expected <mediawiki> with namespace [
> http://www.mediawiki.org/xml/export-0.6/], found [
> http://www.mediawiki.org/xml/export-0.7/]*
> *
> *
>
> Is there a solution to this problem? I appreciate any kind of help..
> Sorry for my bad English
>
> Batica Dzonic
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]<http://mc/[email protected]>
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Dbpedia 3.8 extraction. Problem

Reply via email to