This bugfix has been committed but note that at the moment you should use
the "dump" branch for dump-based extraction as it is more stable
Best,
Dimitris
On Wed, Aug 29, 2012 at 10:40 AM, Batica Dzonic <[email protected]> wrote:
> yeeeeeaa finally :) this work around works :)
> thanks Omri
>
>
> --- On *Tue, 8/28/12, Omri Oren <[email protected]>* wrote:
>
>
> From: Omri Oren <[email protected]>
> Subject: Re: [Dbpedia-discussion] Dbpedia 3.8 extraction. Problem
> To: "Batica Dzonic" <[email protected]>
> Cc: [email protected]
> Date: Tuesday, August 28, 2012, 12:14 AM
>
>
> Yes, that's the error I was talking about.
>
> I fixed it with the following lines (*marked*) added to
> WikipediaDumpParser.java :
>
> else if (isStartElement(CONTRIBUTOR_ELEM))
> {
> * // Check if this is an empty (deleted) contributor tag (i.e.
> <contributor deleted="deleted" /> )*
> * // which has no explicit </contributor> end element. If it is -
> skip it.*
> * String deleted = _reader.getAttributeValue(null, "deleted");*
> * if (deleted != null && deleted.equals("deleted")) {*
> * nextTag();*
> * } else {*
> // now at <contributor>, move to next tag
> nextTag();
> // now should have ip / (author & id), when ip is present we
> don't have author / id
> // TODO Create a getElementName function to make this cleaner
> if (isStartElement(CONTRIBUTOR_IP)) {
> contributorID = "0";
> contributorName = readString(CONTRIBUTOR_IP, false);
> }
> else
> {
> // usually we have contributor name first but we have to
> check
> if (isStartElement(CONTRIBUTOR_NAME))
> {
> contributorName = readString(CONTRIBUTOR_NAME, false);
> nextTag();
> if (isStartElement(CONTRIBUTOR_ID))
> contributorID = readString(CONTRIBUTOR_ID, false);
> }
> else
> {
> // when contributor ID is first
> if (isStartElement(CONTRIBUTOR_ID))
> {
> contributorID = readString(CONTRIBUTOR_ID, false);
> nextTag();
> if (isStartElement(CONTRIBUTOR_NAME))
> contributorName = readString(CONTRIBUTOR_NAME, false);
> }
> }
> }
> nextTag();
>
> requireEndElement(CONTRIBUTOR_ELEM);
> * }*
> }
>
> I'm sure there's a more general / elegant way to identify an XML tag with
> no matching closing tag (i.e. <contributor deleted="deleted" /> ) but this
> solution worked for me with the new Wiki dumps and I couldn't find anything
> better in the existing interface.
>
> Also - I don't know if this should enter the official code repository, and
> if it should, how to do that. *Anyone from the team wants to help here?*
>
> Anyway, if you find these solutions have flaws, or have any insights about
> working with DBpedia 3.8, please share :)
>
> Omri
>
>
>
> On Tue, Aug 28, 2012 at 2:03 AM, Batica Dzonic
> <[email protected]<http://mc/[email protected]>
> > wrote:
>
> Thanks Omri
>
> I tried by replacing all places in code with "export-0.7" but I got the
> following error
>
> ".............
> Caused by: javax.xml.stream.XMLStreamException: ParseError at
> [row,col]:[11300680,16]
> Message: expected </contributor>
>
> at
> org.dbpedia.util.text.xml.XMLStreamUtils.requireElement(XMLStreamUtils.java:115)
> at
>
>
> org.dbpedia.util.text.xml.XMLStreamUtils.requireEndElement(XMLStreamUtils.java:96)
> ............
> "
> I hope that bugfix will resolve the problems...
>
> Thanks again
> Batica Dzonic
>
> --- On *Mon, 8/27/12, Omri Oren
> <[email protected]<http://mc/[email protected]>
> >* wrote:
>
>
> From: Omri Oren <[email protected]<http://mc/[email protected]>
> >
> Subject: Re: [Dbpedia-discussion] Dbpedia 3.8 extraction. Problem
> To: "Batica Dzonic"
> <[email protected]<http://mc/[email protected]>
> >
> Cc:
> [email protected]<http://mc/[email protected]>
> Date: Monday, August 27, 2012, 3:08 PM
>
>
> Hi Batica,
>
> I had the same problem. I fixed it by replacing all places in the code
> that contain the string "export-0.6" with "export-0.7", including data
> files, e.g. *.xml
> I don't know what changes were added to the schema in 0.7, but it seemed
> to work with the extractors that I tried.
>
> (Does anyone here who knows the code / schema better, know if there's
> anything else to do to match version 0.7 of the schema?)
>
> In addition, there's a bugfix to do in the xml parser code (problem when
> the wikipage's <contributor> is "deleted"), I'll send it when I get to the
> office tomorrow.
>
> Omri
> On Aug 28, 2012 12:04 AM, "Batica Dzonic"
> <[email protected]<http://mc/[email protected]>>
> wrote:
>
> Hello,
> I'm working on a project that I need form my master's work. I use DBpedia
> as a knowledge base. Hence, data stored in official DBpedia can become
> outdated I need to re-extract Wikipedia articles. I try to setup DBpedia
> dump extractor, but without success.
> I tried with dbpedia 3.8 framework extractor but there is a problem with
> processing latest wikipedia dump ( Here's the problem with version of
> xml schema as I can see) :
>
> *Caused by: javax.xml.stream.XMLStreamException: ParseError at
> [row,col]:[1,249]*
> *Message: expected <mediawiki> with namespace [
> http://www.mediawiki.org/xml/export-0.6/], found [
> http://www.mediawiki.org/xml/export-0.7/]*
> *
> *
>
> Is there a solution to this problem? I appreciate any kind of help..
> Sorry for my bad English
>
> Batica Dzonic
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]<http://mc/[email protected]>
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
--
Kontokostas Dimitris
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion