Re: [Dbpedia-developers] Refactoring the extraction Framework core to accept new formats

Dimitris Kontokostas Mon, 11 Nov 2013 23:33:35 -0800

On Sun, Nov 10, 2013 at 7:23 PM, Hady elsahar <[email protected]> wrote:


> Hello All,
>
> in order to merge the 
> code<https://github.com/hadyelsahar/extraction-framework/commits/parseJson>written
>  for the GSoC project for wikidata Extraction process , we need
> first to work on issue #38 - Refactoring the core to accept new 
> formats<https://github.com/dbpedia/extraction-framework/issues/38>
> by referring to JC's suggestion  here 
> <https://github.com/dbpedia/extraction-framework/pull/35#issuecomment-16187074>
>  and
> diagram 
> here<https://github.com/dbpedia/extraction-framework/pull/35#issuecomment-16187074>
>
>
> below some points that we may face :
>
>    - in the Design JC suggested , CompositeExtractor sometimes accepts
>    JValue or Wikipage or PageNode . we have two alternatives to implement 
> this:
>    - handling this automatically by checking what type does each of the
>       inner Extractors Accepts , call the parser for it and pass suitable 
> data to
>       the inner extractor
>       - handling this by hardcoding ie. makiing JValueCompositeExtractor
>       . PageNodeCompositeExtractor ..etc , either by Templating or creating
>       subclasses
>
> I think that the first is the goal but I wouldn't mind if you started with
the second approach if it makes it easier for you. Once we have it running
we can refactor later

>
>
>
>    - also in the old Design it was like this :
>       - once we create new Extractor to run it we add it to the config
>       File
>       - ConfigLoader loads it inside the CompositeExtractor
>       - WikiParserWrapper
>       
> <https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/WikiParserWrapper.scala>Decides
>       which parser would be activated
>
> we have to tweak this a little bit in our new design to allow the new
> Level of CompositeExtractors to choose which Extractors to load and which
> not
>
>
> so why wouldn't our design be :
>
>
> - allow CompositeExtractors to accept and pass only WikiPage objects to
> it's inner extractors
>
>
This is what it does in JC's design, there is just an extra level on
CompositeExtractors (more later)


>  - Devise way to Map Extractors and ParserExtractors to PageType (an Enum
> in the Extractor Class and define it in the Subclasses )
>
> - ConfigLoader :
>
> - loads all Extractors from config file
> - creates two ParseExtractors (JSONParseExtractor , WikiParseExtractor)
> - check type of Each needed extractor if it's JSON , Load it to the
> JsonParseExtractor , if it's WikiText Load it to the WikiParseExtractor
>
>
This can be done by the first CompositeExtractor.
gather all Extractor[WikiPage], encapsulate them in a compositeExtractor &
extract them
gather all Extractor[PageNode] and encapsulate them in a compositeExtractor
and pass them to WikitextParseExtractor.
           WikitextParseExtractor: If page type is WikiText, parse it and
pass a PageNode to all enabled extractors[PageNode]. Otherwise return an
empty Quad list
Similar for the JsonValueParseExtractor

This way you don't have to change anything in the configuration loading,
just move the parsing step further down


>  - load JsonParserExtractor , WikiParseExtractor , other extractors to a
> CompositeExtractor
>
> - CompositeExtractor :
>
> - send Wikipage to all inner Extractor objects (JsonParseExtractor ,
> WikiParseExtractor , other normal Extractors)
>
> - JsonParseExtractor :
>
> - If page format is JSON, run WikiPage object through JSON parser and pass
> JValue to all inner Extractors
>
> - Otherwise, do nothing
>
> - WikitextParsingExtractor:
>
> - If page format is wikitext, run WikiPage object through WikiParser and
> pass PageNode to all inner Extractors
>
> - Otherwise, do nothing
>
>
> - WikiparserWrapper functionality will be obsolete because as JC suggested
> to each parser will check page format if it's of the same type parses it
> ,if not do nothing so we remove it
>
>
> Pros would be :
>
>    - simpler Design , less number of classes , less changes as well
>    - skip Extra level of composite extractors that doesn't add any
>    functionality
>    - overcome the part of different inputs and outputs for
>    CompositeExtractor
>    - same configFiles would work
>
> Cons would be :
>
>    - maybe it's confusing that ParseExtractor contains as well inner
>    Extractors
>    - more functionality in the ConfigLoader
>    - we should specify for each of the Extractors what kind of pages it
>    needs to Receive
>
>
>
Maybe I misunderstood but the only change I can see in you diagram with
JC's is a level of CompositeExtractor.
imo this is just a design pattern that helps encapsulate multiple
extractors for a parser and if this is your only concern we can skip this
for now.

Cheers,
Dimitris


> thanks
> Regards
>
> -------------------------------------------------
> Hady El-Sahar
> Research Assistant
> Center of Informatics Sciences | Nile 
> University<http://nileuniversity.edu.eg/>
>
>
>
>
> ------------------------------------------------------------------------------
> November Webinars for C, C++, Fortran Developers
> Accelerate application performance with scalable programming models.
> Explore
> techniques for threading, error checking, porting, and tuning. Get the most
> from the latest Intel processors and coprocessors. See abstracts and
> register
> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
>


-- 
Dimitris Kontokostas
Department of Computer Science, University of Leipzig
Research Group: http://aksw.org
Homepage:http://aksw.org/DimitrisKontokostas

------------------------------------------------------------------------------
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most 
from the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Refactoring the extraction Framework core to accept new formats

Reply via email to