Re: [Dbpedia-gsoc] An Exception when running the extraction

Hady elsahar Sat, 15 Mar 2014 18:08:46 -0700

Hi Wencan ,

the  wiki page is for setting up intellij, but for compiling code , running
or debugging from IntelliJ you have also to set Run/Debug configurations:



1- from top menu open ,  Run/edit configurations .
2- set the following :

main class ->  org.dbpedia.extraction.dump.extract.Extraction
program arguments ->  your config file name
(extraction.default.properties  for example)
working directory  ->   *path-to-your-extraction-framework*
/extraction-framework/dump
use classpath of module  -> dump

3-  save this configuration ,  select it when running  run or Debug from
the interface next time (next to the run green button above).

then you can make, run and debug through the IntelliJ interface and you
wont need to use the mvn command from the CMD.


intellij mvn plugin also keeps track of changes so it compiles
automatically changed files only (faster than mvn commands in cmd , mvn
clean install & mvn run  extraction ... )



On Sun, Mar 16, 2014 at 1:52 AM, wencan luo <[email protected]> wrote:

>
>
>
> On Sat, Mar 15, 2014 at 7:35 PM, Hady elsahar <[email protected]>wrote:
>
>> Hi Wencan ,
>>
>> answers are below each question
>>
>> On Sun, Mar 16, 2014 at 12:43 AM, wencan luo <[email protected]>wrote:
>>
>>> I have successfully compiled the extraction-framework and run the
>>> download for the English Wikipedia.
>>>
>>> However, when I run the extraction, I have the following error:
>>> ################################################################
>>> ....
>>> Caused by: java.io.IOException: failed to list files in
>>> [E:\project\gsoc2014\wik
>>> ipedia\commonswiki]
>>>         at org.dbpedia.extraction.util.RichFile.names(RichFile.scala:44)
>>>         at org.dbpedia.extraction.util.RichFile.names(RichFile.scala:39)
>>>         at org.dbpedia.extraction.util.Finder.dates(Finder.scala:52)
>>>         at
>>> org.dbpedia.extraction.dump.extract.ConfigLoader.latestDate(ConfigLoa
>>> der.scala:196)
>>> ....
>>> [INFO]
>>> ------------------------------------------------------------------------
>>> [INFO] BUILD FAILURE
>>> [INFO]
>>> ------------------------------------------------------------------------
>>> [INFO] Total time: 03:46 h
>>> [INFO] Finished at: 2014-03-15T06:31:41-05:00
>>> [INFO] Final Memory: 10M/231M
>>> [INFO]
>>> ------------------------------------------------------------------------
>>> [ERROR] Failed to execute goal
>>> net.alchim31.maven:scala-maven-plugin:3.1.6:run (
>>> default-cli) on project dump: wrap:
>>> org.apache.commons.exec.ExecuteException: Pr
>>> ocess exited with an error: -10000 (Exit value: -10000) -> [Help 1]
>>> [ERROR]
>>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
>>> -e swit
>>> ch.
>>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>>> [ERROR]
>>> [ERROR] For more information about the errors and possible solutions,
>>> please rea
>>> d the following articles:
>>> [ERROR] [Help 1]
>>> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionE
>>> xception
>>> ###########################################################
>>>
>>> After it, there is only one output file under the dataset folder:
>>> enwiki-20140304-template-redirects.obj
>>>
>>>
>>> In addition, I used the following config parameters for the extraction:
>>> base-dir=E:/project/gsoc2014/wikipedia
>>> source=pages-articles.xml.bz2
>>> languages=en
>>>
>>>
>>
>>> extractors.en=.MappingExtractor,.DisambiguationExtractor,.HomepageExtractor,.ImageExtractor,\
>>>
>>> .PersondataExtractor,.PndExtractor,.TopicalConceptsExtractor,.FlickrWrapprLinkExtractor
>>>
>>> Here are my questions:
>>>
>>
>>
>>> 1. Does different languages have different extractors?
>>>
>>
>> Different functions (labels , description , geodata ) has different
>> extractors but all extractors should work on all languages specificied in
>>  "languages= xx " parameter in config file.
>> u can also list multiple languages like 'language = en,ar,de' and the
>> extraction framework will search for it's corresponding dumps.
>>
>>
>>
>>> 2. Is the default source parameter "pages-articles.xml.bz2"? When I
>>> didn't include this line, I will have an exception saying
>>> **pages-articles.xml not found.
>>>
>>
>> it depends on the names of files inside the downloaded dump (usually
>> downloading through the extraction framework creates everything , but
>> here's simple rules to get the point and be able to use your test sample
>> dumps)
>>
>> if  your base dir  is "E:/project/gsoc2014/wikipedia"
>>
>> 1- in the Dir directory you should have  directory named "xxwiki" where
>> xx is the wiki code  (en for English dump)
>> 2- inside you should have directory called yyyymmdd  where yyyymmdd is
>> the dump date. (this is done automatically also by the extraction framework
>> downloader)
>> 3- inside each yymmdd directory files should have the prefix
>> xxwiki-yyyymmdd-  (this is done automatically also by the extraction
>> framework downloader )
>> 4- if your dump files are called xxwiki-yyyymmdd-pages-articles.xml.bz2 ,
>> set "xxwiki-yyyymmdd-pages-articles.xml.bz2" and so on.
>> usually you set the "source" parameter with the rest of the file name ,
>> after the "xxwiki-yyyymmdd-"
>>
>> 3. How many hours does it take to run the extractor for only the English
>>> and for all the languages?
>>>
>>
>> I don't recap specific numbers for English ( maybe someone help with
>> this) , but for Wikidata dumps, it takes over the night, 5-6 hours on Quad
>> Core Machine with Intel xeon 2.40GHz processors and 16GB Ram ( while
>> working with compressed dumps ( .bz2) which are faster than uncompressed
>> ones ).
>>
>>
>> 4. How many disk space do I need to store all the data?
>>>
>>
>> depends on the number of extractors you specify hence output dumps , and
>> size of input dump
>> you can find all extracted dumps for DBpedia 3.9 here :
>> http://downloads.dbpedia.org/3.9/ , rough estimate would be 4GB for
>> English only more or less.
>>
>> 5. How can I debug an extractor? Testing on the whole Wikipedia dump is
>>> impossible when debugging. It is too slow.
>>>
>>
>> Of course you need debugging, running the whole extraction for one dump
>> only can take multiple hours.
>> which IDE are you using?, i'm using IntelliJ you can simply put break
>> points and debug.
>>
> I just starting to learn the IntelliJ.
> However, I believe another step is needed in the instruction in the wiki:
>
> https://github.com/dbpedia/extraction-framework/wiki/Setting-up-IntelliJ-IDEA
>
> Because using the commend line: mvn run ... to run the extractor has
> nothing business with the IDEA.
> Therefore, I think we have to set some goals or maven module for the
> project? However, currently, there is no such instruction in the wiki.
>
>>
>>
>> -------------------------------------------------
>> Hady El-Sahar
>> Research Assistant
>> Center of Informatics Sciences | Nile 
>> University<http://nileuniversity.edu.eg/>
>>
>>
>>
>
>
> --
> Wencan Luo
> CS Department- Univ. of Pittsburgh
> 210 S. Bouquet Street
> 6501 Sennott Square
> Pittsburgh, PA 15260
> E-mail: [email protected] or [email protected]
>



-- 
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] An Exception when running the extraction

Reply via email to