Hi Wencan ,
answers are below each question
On Sun, Mar 16, 2014 at 12:43 AM, wencan luo <[email protected]> wrote:
> I have successfully compiled the extraction-framework and run the download
> for the English Wikipedia.
>
> However, when I run the extraction, I have the following error:
> ################################################################
> ....
> Caused by: java.io.IOException: failed to list files in
> [E:\project\gsoc2014\wik
> ipedia\commonswiki]
> at org.dbpedia.extraction.util.RichFile.names(RichFile.scala:44)
> at org.dbpedia.extraction.util.RichFile.names(RichFile.scala:39)
> at org.dbpedia.extraction.util.Finder.dates(Finder.scala:52)
> at
> org.dbpedia.extraction.dump.extract.ConfigLoader.latestDate(ConfigLoa
> der.scala:196)
> ....
> [INFO]
> ------------------------------------------------------------------------
> [INFO] BUILD FAILURE
> [INFO]
> ------------------------------------------------------------------------
> [INFO] Total time: 03:46 h
> [INFO] Finished at: 2014-03-15T06:31:41-05:00
> [INFO] Final Memory: 10M/231M
> [INFO]
> ------------------------------------------------------------------------
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.1.6:run (
> default-cli) on project dump: wrap:
> org.apache.commons.exec.ExecuteException: Pr
> ocess exited with an error: -10000 (Exit value: -10000) -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e swit
> ch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please rea
> d the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionE
> xception
> ###########################################################
>
> After it, there is only one output file under the dataset folder:
> enwiki-20140304-template-redirects.obj
>
>
> In addition, I used the following config parameters for the extraction:
> base-dir=E:/project/gsoc2014/wikipedia
> source=pages-articles.xml.bz2
> languages=en
>
>
extractors.en=.MappingExtractor,.DisambiguationExtractor,.HomepageExtractor,.ImageExtractor,\
>
> .PersondataExtractor,.PndExtractor,.TopicalConceptsExtractor,.FlickrWrapprLinkExtractor
>
> Here are my questions:
>
> 1. Does different languages have different extractors?
>
Different functions (labels , description , geodata ) has different
extractors but all extractors should work on all languages specificied in
"languages= xx " parameter in config file.
u can also list multiple languages like 'language = en,ar,de' and the
extraction framework will search for it's corresponding dumps.
> 2. Is the default source parameter "pages-articles.xml.bz2"? When I didn't
> include this line, I will have an exception saying **pages-articles.xml not
> found.
>
it depends on the names of files inside the downloaded dump (usually
downloading through the extraction framework creates everything , but
here's simple rules to get the point and be able to use your test sample
dumps)
if your base dir is "E:/project/gsoc2014/wikipedia"
1- in the Dir directory you should have directory named "xxwiki" where xx
is the wiki code (en for English dump)
2- inside you should have directory called yyyymmdd where yyyymmdd is the
dump date. (this is done automatically also by the extraction framework
downloader)
3- inside each yymmdd directory files should have the prefix
xxwiki-yyyymmdd- (this is done automatically also by the extraction
framework downloader )
4- if your dump files are called xxwiki-yyyymmdd-pages-articles.xml.bz2 ,
set "xxwiki-yyyymmdd-pages-articles.xml.bz2" and so on.
usually you set the "source" parameter with the rest of the file name ,
after the "xxwiki-yyyymmdd-"
3. How many hours does it take to run the extractor for only the English
> and for all the languages?
>
I don't recap specific numbers for English ( maybe someone help with this)
, but for Wikidata dumps, it takes over the night, 5-6 hours on Quad Core
Machine with Intel xeon 2.40GHz processors and 16GB Ram ( while working
with compressed dumps ( .bz2) which are faster than uncompressed ones ).
4. How many disk space do I need to store all the data?
>
depends on the number of extractors you specify hence output dumps , and
size of input dump
you can find all extracted dumps for DBpedia 3.9 here :
http://downloads.dbpedia.org/3.9/ , rough estimate would be 4GB for English
only more or less.
5. How can I debug an extractor? Testing on the whole Wikipedia dump is
> impossible when debugging. It is too slow.
>
Of course you need debugging, running the whole extraction for one dump
only can take multiple hours.
which IDE are you using?, i'm using IntelliJ you can simply put break
points and debug.
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc