Hi Amit, sorry for weighing in late on the subject, but I have a suspicion.
The DBpedia parser that parses the MediaWiki markup is not great at parsing tables. The reason is that the focus previously was on extracting information from infoboxes. The MappingExtractor class is the only one that attempts to use table mappings. If there are mis-parsed table structures, this could lead to infinite loops. All three pages that you mentioned contain tables. There might be syntactical constructions that the parser can't cope with at the moment. If you are able to track down the bug, I would be tremendously helpful if you could fix it. Best regards, Max On Wed, Dec 14, 2011 at 08:40, Amit Kumar <[email protected]> wrote: > Hi Pablo, > I have narrowed down the memory issue that I had been facing. After going > unsuccessfully through the whole enwiki dump, I ran the DEF on a smaller > dump. I picked > http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles27.xml-p029625017p033928886.bz2. > After multiple runs and experiments, we found that there are three pages in > the dump where the DEF sort of got stuck and the heap overshoot any limit > you give. The three pages are > > http://en.wikipedia.org/wiki/Mikoyan-Gurevich_MiG-21_variants > http://en.wikipedia.org/wiki/List_of_fastest_production_motorcycles > http://en.wikipedia.org/wiki/Chevrolet_small-block_engine_table > > > If you skip these three pages (by skipping in dump/.. > .../ExtractionJob.scala) the framework run successfully. On further research > I found that its only the MappingExtractor which is causing the problem. > Once you remove that from config.properties file everything works. > > So from what we know, among 1.5 Million approx pages in the smaller dump, > the MappingExtractor fails on these three pages, taking the whole JVM with > it. I’m attaching three xml (1 wiki page each). Out of these the DEF would > only run on India.xml, for the other two it would keep failing unless you > remove the MappingExtractor. There is something about these above 3 pages > that is not normal(there would be more in the complete wikipedia dump). From > the src file it looks like mappingextractor works on extracting data from > infoboxes and interestingly none of the three pages have a infobox in it. > Could this be a reason ? > > Can someone please look into this. I’m wondering how you guys were able to > generate the 3.7 dbepdia dump. Did you skip the MappingExtractor. Or is it > that the problems in the pages got introduced after the 3.7 run. If this is > the case we would need to fix this as it would definitely fail during the > next release. > > Thanks and Regards > Amit > > > > > On 12/1/11 4:47 PM, "Amit X Kumar" <[email protected]> wrote: > > Hi Pablo, > I figured this out just after sending my email. I’m experimenting with some > values right now. I’ll let you know if I get it to work. In the meanwhile, > if some one already has the working values, it would be a big help. > > Plus do you know anyone running the DEF on Hadoop ? > > Thanks > Amit > > On 12/1/11 4:39 PM, "Pablo Mendes" <[email protected]> wrote: > > Hi Amit, > >> "I tried giving jvm options such –Xmx to the ‘mvn scala:run’ command, but >> it seems that the mvn command spawn another processes and fails to pass on >> the flags to the new one. If someone has been able to run the framework, >> could you please share me the details." > > The easiest way to get it working is probably to change the value in the > dump/pom.xml here: > > <launcher> > <id>Extract</id> > <mainClass>org.dbpedia.extraction.dump.Extract</mainClass> > <jvmArgs> > <jvmArg>-Xmx1024m</jvmArg> > </jvmArgs> > </launcher> > > > Cheers, > Pablo > > On Thu, Dec 1, 2011 at 8:01 AM, Amit Kumar <[email protected]> wrote: > > > Hi Pablo, > Thanks for your valuable input. I got the Mediawiki think working and am > able to run the abstract extractor as well. > > The extraction framework works well for a small sample dataset e.g > http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2 > which > has around 6300 entries. But when I try to run the framework on the full > wikipedia data(en, around 33GB uncompressed) I get java heap space errors. > > -------------------------------------- > Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space > at > java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45) > at java.lang.StringBuilder.<init>(StringBuilder.java:80) > at > scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:43) > at > scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:48) > at > org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:48) > at > org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:34) > at scala.collection.Iterator$class.foreach(Iterator.scala:652) > at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:333) > at > scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:41) > at > scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:80) > at > org.dbpedia.extraction.dump.Extract$ExtractionThread.run(Extract.scala:34) > Exception in thread "Thread-6" java.lang.OutOfMemoryError: Java heap space > at > scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120) > at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128) > at scala.collection.immutable.List.$colon$colon$colon(List.scala:78) > at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26) > at > org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:35) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64) > at scala.Option.foreach(Option.scala:198) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63) > at > org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24) > at > org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47) > at > org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:194) > > > There are several instances of GC overhead limit errors also > Exception in thread "Thread-3" java.lang.OutOfMemoryError: GC overhead limit > exceeded > at > scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120) > at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128) > at scala.collection.immutable.List.$colon$colon$colon(List.scala:78) > at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64) > at scala.Option.foreach(Option.scala:198) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63) > at > org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24) > at > org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47) > at > org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194) > Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run > SEVERE: Error reading pages. Shutting down... > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOfRange(Arrays.java:3209) > at java.lang.String.<init>(String.java:215) > at java.lang.StringBuffer.toString(StringBuffer.java:585) > at > com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getElementText(XMLStreamReaderImpl.java:859) > at > org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDumpParser.java:241) > at > org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:203) > at > org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:159) > at > org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:107) > at > org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:87) > at > org.dbpedia.extraction.sources.XMLSource$XMLFileSource.foreach(XMLSource.scala:40) > at > org.dbpedia.extraction.dump.ExtractionJob.run(ExtractionJob.scala:54) > Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run > > > I’m trying to run this on both a 32 bit and a 64bit machines (dev boxes) but > to no avail. I’m guessing the default JVM configurations are low for the > DEF. > It would be great is someone can tell the minimum memory requirement for the > extraction framework. I tried giving jvm options such –Xmx to the ‘mvn > scala:run’ command, but it seems that the mvn command spawn another > processes and fails to pass on the flags to the new one. If someone has been > able to run the framework, could you please share me the details. > > Also We are looking into to running the framework over Hadoop. Has anyone > tried that yet ? If yes, could you share you experience, also if it is > really possible to run this on Hadoop without many changes and Hacks. > > Thanks > Amit > > > > > > > > > > > > On 11/23/11 2:42 PM, "Pablo Mendes" <[email protected] > <http://[email protected]> > wrote: > > > Hi Amit, > Thanks for your interest in DBpedia. Most of my effort has gone into DBpedia > Spotlight, but I can try to help with the DBpedia Extraction Framework as > well. Maybe the core developers can chip in if I misrepresent somewhere. > > 1) [more docs] > > > I am unaware. > > > 2) [typo in config] > > > Seems ok. > > > 3) ... Am I right ? Does the framework work on any particular dump of > Wikipedia? Also what goes in the commons branch ? > > > Yes. As far as I can tell, you're right. But there is no particular dump. > You just need to follow the convention for the directory structure. The > commons directory has a similar structure, see: > > wikipediaDump/commons/20110729/commonswiki-20110729-pages-articles.xml > > I think this file is only used by the image extractor and maybe a couple of > others. Maybe it should be only mandatory if the corresponding extractors > are included in the config. But it's likely nobody got around to > implementing that catch yet. > > > 4) It seems the AbstractExtractor requires an instance of Mediawiki running > to parse mediawiki syntax. ... Can someone shed some more light on this ? > What customization is required ? Where can I get one ? > > > The abstract extractor is used to render inline templates, as many articles > start with automatically generated content from templates. See: > http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/945c24bdc54c/abstractExtraction > > > > Also another question: Is there a reason for the delay in subsequent Dbpedia > releases ? I was wondering , if the code is already there, why does it take > 6 months between Dbpedia releases? Is there a manual editorial involved or > is it due to development/changes in the framework code which are collated > in every release? > > > One reason might be that a lot of the value in DBpedia comes from manually > generated "homogenization" in mappings.dbpedia.org > <http://mappings.dbpedia.org> <http://mappings.dbpedia.org> . That, plus > getting a stable version of the framework tested and run would probably > explain the choice of periodicity. > > > Best, > Pablo > > On Tue, Nov 22, 2011 at 12:03 PM, Amit Kumar <[email protected] > <http://[email protected]> > wrote: > > > Hey everyone, > I’m trying to setup the Dbpedia extraction framework as I’m interested in > getting structured data from already downloaded wikipedia dumps. As per my > understanding I need to work in the ‘dump’ directory of the codebase. I > have tried to reverse engineer ( given scala is new for me) but I need some > help. > > First of all, is there a more detailed documentation somewhere about setting > and running the pipeline. The one available on dbpedia.org > <http://dbpedia.org> <http://dbpedia.org> seems insufficient. > I understand that I need to create a config.properties file first where I > need to setup input/output locations, list of extractors and the languages. > I tried working with the config.properties.default given in the code. There > seems to be some typo in the extractor list. > ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor’ using > this gives ‘class not found’ error. I converted it to > ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractor’. Is it ok ? > I can’t find the documentation on how to setup the input directory. Can > someone tell me the details? From what I gather, input directory should > contain a ‘commons’ directory plus, directory for all languages set in > config.properties. All these directories must have a subdirectory whose name > should be of YYYYMMDD format. Within that you save the xml files such as > enwiki-20111111-pages-articles.xml. Am I right ? Does the framework work on > any particular dump of Wikipedia? Also what goes in the commons branch ? > I ran the framework by copying a sample dump > http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2 > in both en and commons branch. Unzipping them and renaming as per > requirement. For now I’m working with en language only. It works with the > default 19 extractors but, starts failing if I include AbstractExtractor. It > seems the AbstractExtractor requires an instance of Mediawiki running to > parse mediawiki syntax. From the file itself, “DBpedia-customized MediaWiki > instance is required.” Can someone shed some more light on this ? What > customization is required ? Where can I get one ? > > > > Sorry if the question are too basic and already mentioned somewhere. I have > tried looking but couldn’t find myself. > Also another question: Is there a reason for the delay in subsequent Dbpedia > releases ? I was wondering , if the code is already there, why does it take > 6 months between Dbpedia releases? Is there a manual editorial involved or > is it due to development/changes in the framework code which are collated > in every release? > > > Thanks and regards, > > Amit > Tech Lead > Cloud and Platform Group > Yahoo! > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > Dbpedia-discussion mailing list > [email protected] > <http://[email protected]> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion > > > > > > > > ------------------------------------------------------------------------------ > Cloud Computing - Latest Buzzword or a Glimpse of the Future? > This paper surveys cloud computing today: What are the benefits? > Why are businesses embracing it? What are its payoffs and pitfalls? > http://www.accelacomm.com/jaw/sdnl/114/51425149/ > _______________________________________________ > Dbpedia-discussion mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion > ------------------------------------------------------------------------------ Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex infrastructure or vast IT resources to deliver seamless, secure access to virtual desktops. With this all-in-one solution, easily deploy virtual desktops for less than the cost of PCs and save 60% on VDI infrastructure costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
