Re: [Dbpedia-discussion] Help with Dbpedia Extraction Framework

Pablo Mendes Thu, 01 Dec 2011 03:50:17 -0800

Hi Amit,
I don't know the minimal heap configurations for the DEF. I snooped around
Max's machine, and found 1024M in his pom.xml. If he changed, it is
somewhere I couldn't find.


Last summer I started concocting a hadoop run of the framework, but had to
switch my attention somewhere else, and haven't had time to go back since.
I do not know of anybody who has done it.

Best,
Pablo


On Thu, Dec 1, 2011 at 12:17 PM, Amit Kumar <[email protected]> wrote:

>  Hi Pablo,
> I figured this out just after sending my email. I’m experimenting with
>  some values right now. I’ll let you know if I get it to work. In the
> meanwhile, if some one already has the working values, it would be a big
> help.
>
> Plus do you know anyone running the DEF on Hadoop ?
>
> Thanks
> Amit
>
> On 12/1/11 4:39 PM, "Pablo Mendes" <[email protected]> wrote:
>
> Hi Amit,
>
> > "I tried giving jvm options such  –Xmx to the ‘mvn scala:run’ command,
> but it seems that the mvn command spawn another processes and fails to pass
> on the flags to the new one. If someone has been able to run the framework,
> could you please share me the details."
>
> The easiest way to get it working is probably to change the value in the
> dump/pom.xml here:
>
>                        <launcher>
>                             <id>Extract</id>
>
> <mainClass>org.dbpedia.extraction.dump.Extract</mainClass>
>                             <jvmArgs>
>                                 <jvmArg>-Xmx1024m</jvmArg>
>                             </jvmArgs>
>                         </launcher>
>
>
> Cheers,
> Pablo
>
> On Thu, Dec 1, 2011 at 8:01 AM, Amit Kumar <[email protected]> wrote:
>
>
> Hi Pablo,
> Thanks for your valuable input. I got the Mediawiki think working and am
> able to run the abstract extractor as well.
>
> The extraction framework works well for a small sample dataset e.g
> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2which
> has around 6300 entries. But when I try to run the framework on the full
> wikipedia data(en, around 33GB uncompressed) I get  java heap space errors.
>
> --------------------------------------
> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
>         at
> java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
>         at java.lang.StringBuilder.<init>(StringBuilder.java:80)
>         at
> scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:43)
>         at
> scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:48)
>         at
> org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:48)
>         at
> org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:34)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:652)
>         at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:333)
>         at
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:41)
>         at
> scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:80)
>         at
> org.dbpedia.extraction.dump.Extract$ExtractionThread.run(Extract.scala:34)
> Exception in thread "Thread-6" java.lang.OutOfMemoryError: Java heap space
>         at
> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120)
>         at
> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42)
>         at
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>         at
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>         at
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128)
>         at
> scala.collection.immutable.List.$colon$colon$colon(List.scala:78)
>         at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26)
>         at
> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:35)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
>         at
>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73)
>         at
>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64)
>         at scala.Option.foreach(Option.scala:198)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63)
>         at
> org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24)
>         at
> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
>         at
> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:194)
>
>
> There are several instances of GC overhead limit errors also
> Exception in thread "Thread-3" java.lang.OutOfMemoryError: GC overhead
> limit exceeded
>         at
> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120)
>         at
> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42)
>         at
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>         at
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>         at
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128)
>         at
> scala.collection.immutable.List.$colon$colon$colon(List.scala:78)
>         at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
>         at
>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73)
>         at
>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64)
>         at scala.Option.foreach(Option.scala:198)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63)
>         at
> org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24)
>         at
> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
>         at
> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
> Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run
> SEVERE: Error reading pages. Shutting down...
> java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOfRange(Arrays.java:3209)
>         at java.lang.String.<init>(String.java:215)
>         at java.lang.StringBuffer.toString(StringBuffer.java:585)
>         at
> com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getElementText(XMLStreamReaderImpl.java:859)
>         at
> org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDumpParser.java:241)
>         at
> org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:203)
>         at
> org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:159)
>         at
> org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:107)
>         at
> org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:87)
>         at
> org.dbpedia.extraction.sources.XMLSource$XMLFileSource.foreach(XMLSource.scala:40)
>         at
> org.dbpedia.extraction.dump.ExtractionJob.run(ExtractionJob.scala:54)
> Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run
>
>
> I’m trying to run this on both a 32 bit and a 64bit machines (dev boxes)
> but to no avail. I’m guessing the default JVM configurations are low for
> the DEF.
> It would be great is someone can tell the minimum memory requirement for
> the extraction framework. I tried giving jvm options such  –Xmx to the ‘mvn
> scala:run’ command, but it seems that the mvn command spawn another
> processes and fails to pass on the flags to the new one. If someone has
> been able to run the framework, could you please share me the details.
>
> Also We are looking into to running  the framework over Hadoop. Has anyone
> tried that yet ? If yes, could you share you experience, also if it is
> really possible to run this on Hadoop without many changes and Hacks.
>
> Thanks
> Amit
>
>
>
>
>
>
>
>
>
>
>
> On 11/23/11 2:42 PM, "Pablo Mendes" <[email protected] <
> http://[email protected]> > wrote:
>
>
> Hi Amit,
> Thanks for your interest in DBpedia. Most of my effort has gone into
> DBpedia Spotlight, but I can try to help with the DBpedia Extraction
> Framework as well. Maybe the core developers can chip in if I misrepresent
> somewhere.
>
> 1) [more docs]
>
>
> I am unaware.
>
>
> 2) [typo in config]
>
>
> Seems ok.
>
>
> 3) ... Am I right ? Does the framework work on any particular dump of
> Wikipedia? Also what goes in the commons branch ?
>
>
> Yes. As far as I can tell, you're right. But there is no particular dump.
> You just need to follow the convention for the directory structure. The
> commons directory has a similar structure, see:
>
> wikipediaDump/commons/20110729/commonswiki-20110729-pages-articles.xml
>
> I think this file is only used by the image extractor and maybe a couple
> of others. Maybe it should be only mandatory if the corresponding
> extractors are included in the config. But it's likely nobody got around to
> implementing that catch yet.
>
>
> 4) It seems the AbstractExtractor requires an instance of Mediawiki
> running to parse mediawiki syntax. ... Can someone shed some more light on
> this ? What customization is required ? Where can I get one ?
>
>
> The abstract extractor is used to render inline templates, as many
> articles start with automatically generated content from templates. See:
>
> http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/945c24bdc54c/abstractExtraction
>
>
>
> Also another question: Is there a reason for the delay in subsequent
> Dbpedia releases ? I was wondering , if the code is already there, why does
> it take 6 months between Dbpedia releases? Is there a manual editorial
>  involved or is it due  to development/changes  in the framework code which
> are collated in every release?
>
>
> One reason might be that a lot of the value in DBpedia comes from manually
> generated "homogenization" in mappings.dbpedia.org <
> http://mappings.dbpedia.org>  <http://mappings.dbpedia.org> . That, plus
> getting a stable version of the framework tested and run would probably
> explain the choice of periodicity.
>
>
> Best,
> Pablo
>
>
> On Tue, Nov 22, 2011 at 12:03 PM, Amit Kumar <[email protected] <
> http://[email protected]> > wrote:
>
>
> Hey everyone,
> I’m trying to setup the Dbpedia extraction framework as I’m interested in
> getting structured data from already downloaded wikipedia dumps.  As per my
> understanding  I need to work in the ‘dump’ directory of the codebase. I
> have tried to reverse engineer ( given scala is new for me) but I need some
> help.
>
>
>    1. First of all, is there a more detailed documentation somewhere
>    about setting and running the pipeline. The one available on
>    dbpedia.org <http://dbpedia.org>  <http://dbpedia.org>  seems
>    insufficient.
>    2.
>    3. I understand that I need to create a config.properties file first
>    where I need to setup input/output locations, list of extractors and the
>    languages. I tried working with the config.properties.default given in the
>    code. There seems to be some typo in the extractor list.
>    ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor’
>    using this gives ‘class not found’ error. I converted it to
>    ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractor’. Is it ok ?
>    4. I can’t find the documentation on how to setup the input directory.
>    Can someone tell me the details? From what I gather, input directory should
>    contain a ‘commons’ directory plus, directory for all languages set in
>    config.properties. All these directories must have a subdirectory whose
>    name should be of YYYYMMDD format. Within that you save the xml files such
>    as enwiki-20111111-pages-articles.xml. Am I right ? Does the framework work
>    on any particular dump of Wikipedia? Also what goes in the commons branch ?
>    5. I ran the framework by copying a sample dump
>    
> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2in
>  both en and commons branch. Unzipping them and renaming as per
>    requirement. For now I’m working with en language only. It works with the
>    default 19 extractors but, starts failing if I include *
>    AbstractExtractor.* It seems the AbstractExtractor requires an
>    instance of Mediawiki running to parse mediawiki syntax. From the file
>    itself, “*DBpedia-customized MediaWiki instance is required*.” Can
>    someone shed some more light on this ? What customization is required ?
>    Where can I get one ?
>
>
>
> Sorry if the question are too basic and already mentioned somewhere. I
> have tried looking but couldn’t find myself.
> Also another question: Is there a reason for the delay in subsequent
> Dbpedia releases ? I was wondering , if the code is already there, why does
> it take 6 months between Dbpedia releases? Is there a manual editorial
>  involved or is it due  to development/changes  in the framework code which
> are collated in every release?
>
>
> Thanks and regards,
>
> Amit
> Tech Lead
> Cloud and Platform Group
> Yahoo!
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected] <
> http://[email protected]>
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
>
>
>
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Help with Dbpedia Extraction Framework

Reply via email to