Re: [Dbpedia-discussion] Help with Dbpedia Extraction Framework

Pablo Mendes Fri, 02 Dec 2011 00:44:35 -0800

Tommy, Amit,

> Turns out its Hardcoded in the pom.xml of dump directory.


Well, since the pom.xml is a configuration file, I'd call that
"configurable through the pom.xml" rather than "hardcoded". Or maybe we
*should* call it hardcoded so that the maven-scala-plugin guys get alarmed.
:) I too had problems getting maven and the scala plugin to take my Xmx
parameter. The potential solutions I found were: MAVEN_OPTS (used by the
maven process), JAVA_OPTS (used by scala) (
http://www.scala-lang.org/docu/files/tools/scala.html). None of them really
worked for me, and apparently that is a common problem with maven plugins:
http://stackoverflow.com/questions/8231910/will-child-jvm-inherit-max-heap-size-and-perm-gen-size-when-forked

The only thing that worked for me with the maven-scala-plugin was its own
jvmArgs:
http://scala-tools.org/mvnsites/maven-scala-plugin/run-mojo.html

> I could run the framework multiple times, once for each partial file, but
then
the outputs would in different folders for each file.

Would this be a problem? They are all NT files, so you can just use "cat"
to put them back together in a simple bash script, no?

> Or work on running the framework in a way that it picks all the files in
a folder and also collate the outputs in a single place

Maybe you should file a feature request for this.
http://sourceforge.net/tracker/?group_id=190976&atid=935523

Best,
Pablo

On Fri, Dec 2, 2011 at 5:51 AM, Amit Kumar <[email protected]> wrote:

> Hi Tommy,
> I knew about the MAVEN_OPTS. I tried that but as I mentioned, the flags are
> not being passed on to the child process being spawned. Turns out its
> Hardcoded in the pom.xml of dump directory.
>
> I too was thinking of using partial wikipedia files as input. The problem
> is, The input and output mechanism is sort of hardcoded. It expects a
> single
> file per langauge e.g input/en/20111107/enwiki-20111107-pages-articles.xml
> .
>
> Now I have two options. If I don't want to make any changes in the code, I
> could run the framework multiple times, once for each partial file, but
> then
> the outputs would in different folders for each file.
>
> Or work on running the framework in a way that it picks all the files in a
> folder and also collate the outputs in a single place. But this would
> entail
> changes in code. Is there a simple way in the Dbpedia Extraction Framework
> itself to pick multiple files in one directory and collate the results. I
> can't seem to find it. As per my understanding I would need to change the
> ConfigLoader class.
>
>
> Have you either of this ?
>
> Thanks and Regards,
> Amit
>
>
> On 12/1/11 11:22 PM, "Tommy Chheng" <[email protected]> wrote:
>
> > When using mvn scala:run, use MAVEN_OPTS=-Xmx rather than JAVA_OPTS
> >
> > The dump also comes in 27 files rather than one big one. You can use
> > these alternatively.
> >
> > --
> > @tommychheng
> > qwiki.com
> >
> >
> > On Wed, Nov 30, 2011 at 11:01 PM, Amit Kumar <[email protected]>
> wrote:
> >>
> >> Hi Pablo,
> >> Thanks for your valuable input. I got the Mediawiki think working and am
> >> able to run the abstract extractor as well.
> >>
> >> The extraction framework works well for a small sample dataset e.g
> >>
> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p0
> >> 00000010p000010000.bz2
> >> which
> >> has around 6300 entries. But when I try to run the framework on the full
> >> wikipedia data(en, around 33GB uncompressed) I get  java heap space
> errors.
> >>
> >> --------------------------------------
> >> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap
> space
> >>         at
> >> java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
> >>         at java.lang.StringBuilder.<init>(StringBuilder.java:80)
> >>         at
> >> scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:43)
> >>         at
> >> scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:48)
> >>         at
> >>
> org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Ext
> >> ract.scala:48)
> >>         at
> >>
> org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Ext
> >> ract.scala:34)
> >>         at scala.collection.Iterator$class.foreach(Iterator.scala:652)
> >>         at
> scala.collection.Iterator$$anon$19.foreach(Iterator.scala:333)
> >>         at
> >>
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.
> >> scala:41)
> >>         at
> >>
> scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:80)
> >>         at
> >>
> org.dbpedia.extraction.dump.Extract$ExtractionThread.run(Extract.scala:34)
> >> Exception in thread "Thread-6" java.lang.OutOfMemoryError: Java heap
> space
> >>         at
> >> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120)
> >>         at
> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42)
> >>         at
> >>
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.sca
> >> la:48)
> >>         at
> >>
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.sca
> >> la:48)
> >>         at
> >>
>
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>>
> )
> >>         at scala.collection.immutable.List.foreach(List.scala:45)
> >>         at
> >> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> >>         at
> >> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128)
> >>         at
> scala.collection.immutable.List.$colon$colon$colon(List.scala:78)
> >>         at
> org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:3
> >> 5)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(Tab
> >> leMapping.scala:39)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(Tab
> >> leMapping.scala:37)
> >>         at
> >>
>
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>>
> )
> >>         at scala.collection.immutable.List.foreach(List.scala:45)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:3
> >> 7)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun
> >> $apply$4.apply(TableMapping.scala:73)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun
> >> $apply$4.apply(TableMapping.scala:64)
> >>         at scala.Option.foreach(Option.scala:198)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(Ta
> >> bleMapping.scala:64)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(Ta
> >> bleMapping.scala:63)
> >>         at
> >>
>
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>>
> )
> >>         at scala.collection.immutable.List.foreach(List.scala:45)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:
> >> 63)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24)
> >>         at
> >>
> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtr
> >> actor.scala:47)
> >>         at
> >>
> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtr
> >> actor.scala:47)
> >>         at
> >>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:1
> >> 94)
> >>         at
> >>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:1
> >> 94)
> >>         at
> >>
>
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>>
> )
> >>         at scala.collection.immutable.List.foreach(List.scala:45)
> >>         at
> >> scala.collection.TraversableLike$class.map(TraversableLike.scala:194)
> >>
> >>
> >> There are several instances of GC overhead limit errors also
> >> Exception in thread "Thread-3" java.lang.OutOfMemoryError: GC overhead
> limit
> >> exceeded
> >>         at
> >> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120)
> >>         at
> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42)
> >>         at
> >>
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.sca
> >> la:48)
> >>         at
> >>
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.sca
> >> la:48)
> >>         at
> >>
>
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>>
> )
> >>         at scala.collection.immutable.List.foreach(List.scala:45)
> >>         at
> >> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> >>         at
> >> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128)
> >>         at
> scala.collection.immutable.List.$colon$colon$colon(List.scala:78)
> >>         at
> org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(Tab
> >> leMapping.scala:39)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(Tab
> >> leMapping.scala:37)
> >>         at
> >>
>
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>>
> )
> >>         at scala.collection.immutable.List.foreach(List.scala:45)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:3
> >> 7)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(Tab
> >> leMapping.scala:39)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(Tab
> >> leMapping.scala:37)
> >>         at
> >>
>
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>>
> )
> >>         at scala.collection.immutable.List.foreach(List.scala:45)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:3
> >> 7)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun
> >> $apply$4.apply(TableMapping.scala:73)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun
> >> $apply$4.apply(TableMapping.scala:64)
> >>         at scala.Option.foreach(Option.scala:198)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(Ta
> >> bleMapping.scala:64)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(Ta
> >> bleMapping.scala:63)
> >>         at
> >>
>
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>>
> )
> >>         at scala.collection.immutable.List.foreach(List.scala:45)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:
> >> 63)
> >>         at
> >>
> org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24)
> >>         at
> >>
> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtr
> >> actor.scala:47)
> >>         at
> >>
> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtr
> >> actor.scala:47)
> >>         at
> >>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:1
> >> 94)
> >> Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run
> >> SEVERE: Error reading pages. Shutting down...
> >> java.lang.OutOfMemoryError: Java heap space
> >>         at java.util.Arrays.copyOfRange(Arrays.java:3209)
> >>         at java.lang.String.<init>(String.java:215)
> >>         at java.lang.StringBuffer.toString(StringBuffer.java:585)
> >>         at
> >>
> com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getElementText(XM
> >> LStreamReaderImpl.java:859)
> >>         at
> >>
> org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDump
> >> Parser.java:241)
> >>         at
> >>
> org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpPars
> >> er.java:203)
> >>         at
> >>
> org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpPar
> >> ser.java:159)
> >>         at
> >>
> org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpPars
> >> er.java:107)
> >>         at
> >>
> org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.ja
> >> va:87)
> >>         at
> >>
> org.dbpedia.extraction.sources.XMLSource$XMLFileSource.foreach(XMLSource.scal
> >> a:40)
> >>         at
> >> org.dbpedia.extraction.dump.ExtractionJob.run(ExtractionJob.scala:54)
> >> Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run
> >>
> >>
> >> I¹m trying to run this on both a 32 bit and a 64bit machines (dev
> boxes) but
> >> to no avail. I¹m guessing the default JVM configurations are low for the
> >> DEF.
> >> It would be great is someone can tell the minimum memory requirement
> for the
> >> extraction framework. I tried giving jvm options such  Xmx to the Œmvn
> >> scala:run¹ command, but it seems that the mvn command spawn another
> >> processes and fails to pass on the flags to the new one. If someone has
> been
> >> able to run the framework, could you please share me the details.
> >>
> >> Also We are looking into to running  the framework over Hadoop. Has
> anyone
> >> tried that yet ? If yes, could you share you experience, also if it is
> >> really possible to run this on Hadoop without many changes and Hacks.
> >>
> >> Thanks
> >> Amit
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 11/23/11 2:42 PM, "Pablo Mendes" <[email protected]> wrote:
> >>
> >>
> >> Hi Amit,
> >> Thanks for your interest in DBpedia. Most of my effort has gone into
> DBpedia
> >> Spotlight, but I can try to help with the DBpedia Extraction Framework
> as
> >> well. Maybe the core developers can chip in if I misrepresent somewhere.
> >>
> >> 1) [more docs]
> >>
> >>
> >> I am unaware.
> >>
> >>
> >> 2) [typo in config]
> >>
> >>
> >> Seems ok.
> >>
> >>
> >> 3) ... Am I right ? Does the framework work on any particular dump of
> >> Wikipedia? Also what goes in the commons branch ?
> >>
> >>
> >> Yes. As far as I can tell, you're right. But there is no particular
> dump.
> >> You just need to follow the convention for the directory structure. The
> >> commons directory has a similar structure, see:
> >>
> >> wikipediaDump/commons/20110729/commonswiki-20110729-pages-articles.xml
> >>
> >> I think this file is only used by the image extractor and maybe a
> couple of
> >> others. Maybe it should be only mandatory if the corresponding
> extractors
> >> are included in the config. But it's likely nobody got around to
> >> implementing that catch yet.
> >>
> >>
> >> 4) It seems the AbstractExtractor requires an instance of Mediawiki
> running
> >> to parse mediawiki syntax. ... Can someone shed some more light on this
> ?
> >> What customization is required ? Where can I get one ?
> >>
> >>
> >> The abstract extractor is used to render inline templates, as many
> articles
> >> start with automatically generated content from templates. See:
> >>
> http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/945c24bdc54c/abs
> >> tractExtraction
> >>
> >>
> >>
> >> Also another question: Is there a reason for the delay in subsequent
> Dbpedia
> >> releases ? I was wondering , if the code is already there, why does it
> take
> >> 6 months between Dbpedia releases? Is there a manual editorial
>  involved or
> >> is it due  to development/changes  in the framework code which are
> collated
> >> in every release?
> >>
> >>
> >> One reason might be that a lot of the value in DBpedia comes from
> manually
> >> generated "homogenization" in mappings.dbpedia.org
> >> <http://mappings.dbpedia.org> . That, plus getting a stable version of
> the
> >> framework tested and run would probably explain the choice of
> periodicity.
> >>
> >> Best,
> >> Pablo
> >>
> >> On Tue, Nov 22, 2011 at 12:03 PM, Amit Kumar <[email protected]>
> wrote:
> >>
> >>
> >> Hey everyone,
> >> I¹m trying to setup the Dbpedia extraction framework as I¹m interested
> in
> >> getting structured data from already downloaded wikipedia dumps.  As
> per my
> >> understanding  I need to work in the Œdump¹ directory of the codebase. I
> >> have tried to reverse engineer ( given scala is new for me) but I need
> some
> >> help.
> >>
> >> First of all, is there a more detailed documentation somewhere about
> setting
> >> and running the pipeline. The one available on dbpedia.org
> >> <http://dbpedia.org>  seems insufficient.
> >> I understand that I need to create a config.properties file first where
> I
> >> need to setup input/output locations, list of extractors and the
> languages.
> >> I tried working with the config.properties.default given in the code.
> There
> >> seems to be some typo in the extractor list.
> >> Œorg.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor¹
> using
> >> this gives Œclass not found¹ error. I converted it to
> >> Œorg.dbpedia.extraction.mappings.InterLanguageLinksExtractor¹. Is it ok
> ?
> >> I can¹t find the documentation on how to setup the input directory. Can
> >> someone tell me the details? From what I gather, input directory should
> >> contain a Œcommons¹ directory plus, directory for all languages set in
> >> config.properties. All these directories must have a subdirectory whose
> name
> >> should be of YYYYMMDD format. Within that you save the xml files such as
> >> enwiki-20111111-pages-articles.xml. Am I right ? Does the framework
> work on
> >> any particular dump of Wikipedia? Also what goes in the commons branch ?
> >> I ran the framework by copying a sample dump
> >>
> >>
> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p0
> >> 00000010p000010000.bz2
> >> in both en and commons branch. Unzipping them and renaming as per
> >> requirement. For now I¹m working with en language only. It works with
> the
> >> default 19 extractors but, starts failing if I include
> AbstractExtractor. It
> >> seems the AbstractExtractor requires an instance of Mediawiki running to
> >> parse mediawiki syntax. From the file itself, ³DBpedia-customized
> MediaWiki
> >> instance is required.² Can someone shed some more light on this ? What
> >> customization is required ? Where can I get one ?
> >>
> >>
> >>
> >> Sorry if the question are too basic and already mentioned somewhere. I
> have
> >> tried looking but couldn¹t find myself.
> >> Also another question: Is there a reason for the delay in subsequent
> Dbpedia
> >> releases ? I was wondering , if the code is already there, why does it
> take
> >> 6 months between Dbpedia releases? Is there a manual editorial
>  involved or
> >> is it due  to development/changes  in the framework code which are
> collated
> >> in every release?
> >>
> >>
> >> Thanks and regards,
> >>
> >> Amit
> >> Tech Lead
> >> Cloud and Platform Group
> >> Yahoo!
> >>
> >>
>
> ----------------------------------------------------------------------------->>
> -
> >> All the data continuously generated in your IT infrastructure
> >> contains a definitive record of customers, application performance,
> >> security threats, fraudulent activity, and more. Splunk takes this
> >> data and makes sense of it. IT sense. And common sense.
> >> http://p.sf.net/sfu/splunk-novd2d
> >> _______________________________________________
> >> Dbpedia-discussion mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
> >>
> >>
> >>
> >>
> >>
>
> ----------------------------------------------------------------------------->>
> -
> >> All the data continuously generated in your IT infrastructure
> >> contains a definitive record of customers, application performance,
> >> security threats, fraudulent activity, and more. Splunk takes this
> >> data and makes sense of it. IT sense. And common sense.
> >> http://p.sf.net/sfu/splunk-novd2d
> >> _______________________________________________
> >> Dbpedia-discussion mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
> >>
> >
> >
> >
> > --
> > @tommychheng
> > http://tommy.chheng.com
>
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Help with Dbpedia Extraction Framework

Reply via email to