yeeeeeaa finally :) this work around works :) thanks Omri
--- On Tue, 8/28/12, Omri Oren <[email protected]> wrote:
From: Omri Oren <[email protected]>
Subject: Re: [Dbpedia-discussion] Dbpedia 3.8 extraction. Problem
To: "Batica Dzonic" <[email protected]>
Cc: [email protected]
Date: Tuesday, August 28, 2012, 12:14 AM
Yes, that's the error I was talking about.
I fixed it with the following lines (marked) added to WikipediaDumpParser.java :
else if (isStartElement(CONTRIBUTOR_ELEM)) { // Check if this
is an empty (deleted) contributor tag (i.e. <contributor deleted="deleted" /> )
// which has no explicit </contributor> end element. If it is - skip
it. String deleted = _reader.getAttributeValue(null, "deleted");
if (deleted != null && deleted.equals("deleted")) { nextTag();
} else { // now at <contributor>, move to next tag
nextTag(); // now should have ip / (author & id), when ip is
present we don't have author / id
// TODO Create a getElementName function to make this cleaner
if (isStartElement(CONTRIBUTOR_IP)) { contributorID = "0";
contributorName = readString(CONTRIBUTOR_IP, false);
} else { // usually we have
contributor name first but we have to check if
(isStartElement(CONTRIBUTOR_NAME))
{ contributorName = readString(CONTRIBUTOR_NAME,
false); nextTag(); if
(isStartElement(CONTRIBUTOR_ID)) contributorID =
readString(CONTRIBUTOR_ID, false);
} else { // when
contributor ID is first if (isStartElement(CONTRIBUTOR_ID))
{
contributorID = readString(CONTRIBUTOR_ID, false);
nextTag(); if (isStartElement(CONTRIBUTOR_NAME))
contributorName = readString(CONTRIBUTOR_NAME, false);
} } } nextTag();
requireEndElement(CONTRIBUTOR_ELEM); }
}
I'm sure there's a more general / elegant way to identify an XML tag with no
matching closing tag (i.e. <contributor deleted="deleted" /> ) but this
solution worked for me with the new Wiki dumps and I couldn't find anything
better in the existing interface.
Also - I don't know if this should enter the official code repository, and if
it should, how to do that. Anyone from the team wants to help here?
Anyway, if you find these solutions have flaws, or have any insights about
working with DBpedia 3.8, please share :)
Omri
On Tue, Aug 28, 2012 at 2:03 AM, Batica Dzonic <[email protected]> wrote:
Thanks Omri
I tried by replacing all places in code with "export-0.7" but I got the
following error
".............Caused by: javax.xml.stream.XMLStreamException: ParseError at
[row,col]:[11300680,16]Message: expected </contributor>
at
org.dbpedia.util.text.xml.XMLStreamUtils.requireElement(XMLStreamUtils.java:115)at org.dbpedia.util.text.xml.XMLStreamUtils.requireEndElement(XMLStreamUtils.java:96)............
"I hope that bugfix will resolve the problems...
Thanks again Batica Dzonic
--- On Mon, 8/27/12, Omri Oren <[email protected]> wrote:
From: Omri Oren <[email protected]>
Subject: Re: [Dbpedia-discussion] Dbpedia 3.8 extraction. Problem
To: "Batica Dzonic" <[email protected]>
Cc: [email protected]
Date: Monday, August 27, 2012, 3:08 PM
Hi Batica,
I had the same problem. I fixed it by replacing all places in the code that
contain the string "export-0.6" with "export-0.7", including data files, e.g.
*.xml
I don't know what changes were added to the schema in 0.7, but it seemed to
work with the extractors that I tried.
(Does anyone here who knows the code / schema better, know if there's anything
else to do to match version 0.7 of the schema?)
In addition, there's a bugfix to do in the xml parser code (problem when the
wikipage's <contributor> is "deleted"), I'll send it when I get to the office
tomorrow.
Omri
On Aug 28, 2012 12:04 AM, "Batica Dzonic" <[email protected]> wrote:
Hello,
I'm working on a project that I need form my master's work. I use DBpedia as a
knowledge base. Hence, data stored in official DBpedia can become outdated I
need to re-extract Wikipedia articles. I try to setup DBpedia dump extractor,
but without success.
I tried with dbpedia 3.8 framework extractor but there is a problem with
processing latest wikipedia dump ( Here's the problem with version of xml
schema as I can see) :
Caused by:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,249]Message:
expected <mediawiki> with namespace [http://www.mediawiki.org/xml/export-0.6/],
found [http://www.mediawiki.org/xml/export-0.7/]
Is there a solution to this problem? I appreciate any kind of help..
Sorry for my bad English
Batica Dzonic
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion