[Dbpedia-developers] Extracting from List pages

Vladimir Alexiev Fri, 10 Jul 2015 08:20:08 -0700

Hi folks, we also need to extract from List_of_X pages, so we’d like to 
collaborate. Some notes:


- we plan to “convert” List_of_X to a category X: i.e. each item in the list 
will get a category assignment X (e.g. List_of_Christmas_foods -> 
category:Christmas_foods)

- Rather than creating category X blindly, we should search for it using some 
simple heuristics (e.g. the list has a similarly-named category, or a 
similarly-named category exists). 

- IMHO it’s only safe to treat the first link in each list item this way (e.g. 
examples of “* [[Pudding]], traditionally made in the [[Australian Bush]]”)

 

What’s your progress on this task? 

Will you be addressing multiple languages?

Please email the 3 of us at Ontotext, thanks!

 

From: Nico Ring [mailto:nico.r...@student.hpi.de] 
Sent: Wednesday, May 20, 2015 3:50 PM
To: dbpedia-developers@lists.sourceforge.net
Cc: Mischkewitz, Sven; Fabian Windheuser; Patrick Kuhn
Subject: [Dbpedia-developers] Questions about DBpedia Extraction Framework

 

Hi all,

 

we are students of the Hasso Plattner Institute and take part in a seminar of 
the Semantic Web chair. Our task is to extract information from Wikipedia „List 
of“ pages to DBpedia. Therefore we thought about using the 
DBpediaExtractionFramework.

 

We have some questions regarding the framework:

*       We use IntelliJ for development. But if we place a breakpoint in the 
extraction method of our extractor and debug our maven goal the debugger 
doesn’t stop. We already tried out to use mvnDebug and attach to it using the 
RemoteDebugger from IntelliJ. Is there anything we need to do, to debug the 
framework?
*       There is a dataset needed to be set for each Extractor, what is purpose 
of it?
*       Is there a way to add state to the extraction process or some static 
information? It seems for us like the context object does something like that, 
but we don’t really understand where the content comes from and how to add new 
objects to it.
*       We also want to extract List_of pages which are in a table format. We 
found the classes `Table Node, TableRowNode, TableCellNode` which we would like 
to use. But if we extend `PageNodeExtractor` the tables don’t get wrapped in 
these classes, but are just TextNodes and InternalLinkNodes. There is a class 
called TableMapping, which looks handy, but we don’t know if and how we could 
use it.
*       Is there a way to do after processing of the results?

 

Thanks in advance for answering all the questions.

 

Kind Regards,

Patrick Kuhn, Fabian Windheuser, Sven Mischkewitz and Nico Ring

------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/

_______________________________________________
Dbpedia-developers mailing list
Dbpedia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

[Dbpedia-developers] Extracting from List pages

Reply via email to