Re: [Dbpedia-developers] Refactoring the extraction Framework core to accept new formats

Dimitris Kontokostas Tue, 19 Nov 2013 08:18:25 -0800

yes, JC removed this completely in 'dump' branch so use what's more
convenient for now,
we don't plan to use sweble for now and wiktionary will be fixed when we
merge dump with master



On Tue, Nov 19, 2013 at 4:51 PM, Hady elsahar <[email protected]> wrote:

> some questions again
>
> in ConfigLoader.scala
>
>  private val parser = WikiParser.getInstance(config.parser)
>
>
> getinstance is handled by in WikiParserWrapper.scala 
> <https://github.com/ninniuz/extraction-framework/blob/6cb6a7b5ebe65a6804ef9bb43d05fdf72b55c577/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/WikiParserWrapper.scala>
>  which
> is responsible to choose which Parser to return depending on the given name
>  as well as the WikiPage.format
>
> case WikiPageFormat.WikiText =>
>>          if (wikiTextParserName == null ||
>> wikiTextParserName.equals("simple")){
>>            simpleWikiParser(page)
>>          } else {
>>            swebleWikiParser(page)
>>          }
>
>
>
> 1- I guess what JC's meant in his Design by  WikiParser is the
> SimpleWikiParser  ?
> because the 'wikiTextParserName'  that is always send from the
> Config.scala  is  "simple"
>
>   val parser = config.getProperty("parser", "simple")
>
>
> so should i use only the simpleWikiParser hardcoded or should i check for
> the name stored first in the config.scala ?
>
>
>
>
> On Tue, Nov 19, 2013 at 1:13 PM, Hady elsahar <[email protected]>wrote:
>
>> Hi Dimitris ,
>>
>> for the sake of update and to catch early issues , here's the updates so
>> far
>>
>>    -
>>
>>    change Extractor Trait to accept [T] type argument [see 
>> commit<https://github.com/hadyelsahar/extraction-framework/commit/e26ef813dad098d573be34191dfaef13c78b5986>
>>    ]
>>    - change all existing Extractors to accept type PageNode
>>       - change functions in config.scala to load Extractors of type 'any'
>>       - check compositeExtractor.scala to check for Extractor Type
>>       - run and check that update works fine
>>    -
>>
>>    change CompostiteExtractor class to load any type of classes not only
>>    PageNode [see 
>> commit<https://github.com/hadyelsahar/extraction-framework/commit/17dcaa8b2988e7fc8676532fa849fff1eabec9d0>
>>    ]
>>
>> thanks
>> Regards
>>
>>
>>
>> On Mon, Nov 18, 2013 at 8:09 AM, Dimitris Kontokostas <
>> [email protected]> wrote:
>>
>>>
>>>
>>>
>>> On Mon, Nov 18, 2013 at 7:04 AM, Hady elsahar <[email protected]>wrote:
>>>
>>>> *Some Questions : *
>>>>
>>>> 1-
>>>> in the Trait Extractor , that all our extractors implements , if we
>>>> changed it from :
>>>>
>>>> trait Extractor extends Mapping[PageNode]
>>>>
>>>> to
>>>>
>>>> trait Extractor [T] extends Mapping[T]
>>>>
>>>>
>>>> we will need to refactor all Extractor classes to add which type of
>>>> data they accept
>>>> do you is this ok ?
>>>>
>>>
>>> I think this is a good choice, it will also help catch errors at compile
>>> time and 'T' will never change for an Extractror
>>>
>>>
>>>> i tried a for some time to tweak it using Scala upper and lower type
>>>> bound to make PageNode the Default type when the type is not set , but i
>>>> didn't manage to do it . ( but we wouldn't need of course that if we added
>>>> the type to all existing constructors)
>>>>
>>>
>>> I think lower/upper bounds work for sub-super types only. Maybe there is
>>> a scala tweak here that I am not aware of but if no one objects, let's keep
>>> it simple.
>>>
>>>  Best,
>>> Dimitris
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Nov 12, 2013 at 9:32 AM, Dimitris Kontokostas <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Nov 10, 2013 at 7:23 PM, Hady elsahar 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Hello All,
>>>>>>
>>>>>> in order to merge the 
>>>>>> code<https://github.com/hadyelsahar/extraction-framework/commits/parseJson>written
>>>>>>  for the GSoC project for wikidata Extraction process , we need
>>>>>> first to work on issue #38 - Refactoring the core to accept new
>>>>>> formats <https://github.com/dbpedia/extraction-framework/issues/38>
>>>>>> by referring to JC's suggestion  here 
>>>>>> <https://github.com/dbpedia/extraction-framework/pull/35#issuecomment-16187074>
>>>>>>  and
>>>>>> diagram 
>>>>>> here<https://github.com/dbpedia/extraction-framework/pull/35#issuecomment-16187074>
>>>>>>
>>>>>>
>>>>>> below some points that we may face :
>>>>>>
>>>>>>    - in the Design JC suggested , CompositeExtractor sometimes
>>>>>>    accepts JValue or Wikipage or PageNode . we have two alternatives to
>>>>>>    implement this:
>>>>>>    - handling this automatically by checking what type does each of
>>>>>>       the inner Extractors Accepts , call the parser for it and pass 
>>>>>> suitable
>>>>>>       data to the inner extractor
>>>>>>       - handling this by hardcoding ie. makiing
>>>>>>       JValueCompositeExtractor . PageNodeCompositeExtractor ..etc , 
>>>>>> either by
>>>>>>       Templating or creating subclasses
>>>>>>
>>>>>> I think that the first is the goal but I wouldn't mind if you started
>>>>> with the second approach if it makes it easier for you. Once we have it
>>>>> running we can refactor later
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - also in the old Design it was like this :
>>>>>>       - once we create new Extractor to run it we add it to the
>>>>>>       config File
>>>>>>       - ConfigLoader loads it inside the CompositeExtractor
>>>>>>       - WikiParserWrapper
>>>>>>       
>>>>>> <https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/WikiParserWrapper.scala>Decides
>>>>>>       which parser would be activated
>>>>>>
>>>>>> we have to tweak this a little bit in our new design to allow the new
>>>>>> Level of CompositeExtractors to choose which Extractors to load and which
>>>>>> not
>>>>>>
>>>>>>
>>>>>> so why wouldn't our design be :
>>>>>>
>>>>>>
>>>>>> - allow CompositeExtractors to accept and pass only WikiPage objects
>>>>>> to it's inner extractors
>>>>>>
>>>>>>
>>>>> This is what it does in JC's design, there is just an extra level on
>>>>> CompositeExtractors (more later)
>>>>>
>>>>>
>>>>>>  - Devise way to Map Extractors and ParserExtractors to PageType (an
>>>>>> Enum in the Extractor Class and define it in the Subclasses )
>>>>>>
>>>>>> - ConfigLoader :
>>>>>>
>>>>>> - loads all Extractors from config file
>>>>>> - creates two ParseExtractors (JSONParseExtractor ,
>>>>>> WikiParseExtractor)
>>>>>> - check type of Each needed extractor if it's JSON , Load it to the
>>>>>> JsonParseExtractor , if it's WikiText Load it to the WikiParseExtractor
>>>>>>
>>>>>>
>>>>> This can be done by the first CompositeExtractor.
>>>>> gather all Extractor[WikiPage], encapsulate them in a
>>>>> compositeExtractor & extract them
>>>>> gather all Extractor[PageNode] and encapsulate them in a
>>>>> compositeExtractor and pass them to WikitextParseExtractor.
>>>>>            WikitextParseExtractor: If page type is WikiText, parse it
>>>>> and pass a PageNode to all enabled extractors[PageNode]. Otherwise return
>>>>> an empty Quad list
>>>>> Similar for the JsonValueParseExtractor
>>>>>
>>>>> This way you don't have to change anything in the configuration
>>>>> loading, just move the parsing step further down
>>>>>
>>>>>
>>>>>>  - load JsonParserExtractor , WikiParseExtractor , other extractors
>>>>>> to a CompositeExtractor
>>>>>>
>>>>>> - CompositeExtractor :
>>>>>>
>>>>>> - send Wikipage to all inner Extractor objects (JsonParseExtractor ,
>>>>>> WikiParseExtractor , other normal Extractors)
>>>>>>
>>>>>> - JsonParseExtractor :
>>>>>>
>>>>>> - If page format is JSON, run WikiPage object through JSON parser and
>>>>>> pass JValue to all inner Extractors
>>>>>>
>>>>>> - Otherwise, do nothing
>>>>>>
>>>>>> - WikitextParsingExtractor:
>>>>>>
>>>>>> - If page format is wikitext, run WikiPage object through WikiParser
>>>>>> and pass PageNode to all inner Extractors
>>>>>>
>>>>>> - Otherwise, do nothing
>>>>>>
>>>>>>
>>>>>> - WikiparserWrapper functionality will be obsolete because as JC
>>>>>> suggested to each parser will check page format if it's of the same type
>>>>>> parses it ,if not do nothing so we remove it
>>>>>>
>>>>>>
>>>>>> Pros would be :
>>>>>>
>>>>>>    - simpler Design , less number of classes , less changes as well
>>>>>>    - skip Extra level of composite extractors that doesn't add any
>>>>>>    functionality
>>>>>>    - overcome the part of different inputs and outputs for
>>>>>>    CompositeExtractor
>>>>>>    - same configFiles would work
>>>>>>
>>>>>> Cons would be :
>>>>>>
>>>>>>    - maybe it's confusing that ParseExtractor contains as well inner
>>>>>>    Extractors
>>>>>>    - more functionality in the ConfigLoader
>>>>>>    - we should specify for each of the Extractors what kind of pages
>>>>>>    it needs to Receive
>>>>>>
>>>>>>
>>>>>>
>>>>> Maybe I misunderstood but the only change I can see in you diagram
>>>>> with JC's is a level of CompositeExtractor.
>>>>> imo this is just a design pattern that helps encapsulate multiple
>>>>> extractors for a parser and if this is your only concern we can skip this
>>>>> for now.
>>>>>
>>>>> Cheers,
>>>>> Dimitris
>>>>>
>>>>>
>>>>>>  thanks
>>>>>> Regards
>>>>>>
>>>>>> -------------------------------------------------
>>>>>> Hady El-Sahar
>>>>>> Research Assistant
>>>>>> Center of Informatics Sciences | Nile 
>>>>>> University<http://nileuniversity.edu.eg/>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> November Webinars for C, C++, Fortran Developers
>>>>>> Accelerate application performance with scalable programming models.
>>>>>> Explore
>>>>>> techniques for threading, error checking, porting, and tuning. Get
>>>>>> the most
>>>>>> from the latest Intel processors and coprocessors. See abstracts and
>>>>>> register
>>>>>>
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
>>>>>> _______________________________________________
>>>>>> Dbpedia-developers mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dimitris Kontokostas
>>>>> Department of Computer Science, University of Leipzig
>>>>> Research Group: http://aksw.org
>>>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -------------------------------------------------
>>>> Hady El-Sahar
>>>> Research Assistant
>>>> Center of Informatics Sciences | Nile 
>>>> University<http://nileuniversity.edu.eg/>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps
>>>> OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access
>>>> Free app hosting. Or install the open source package on any LAMP server.
>>>> Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native!
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk
>>>>
>>>> _______________________________________________
>>>> Dbpedia-developers mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>
>>>>
>>>
>>>
>>> --
>>> Dimitris Kontokostas
>>> Department of Computer Science, University of Leipzig
>>> Research Group: http://aksw.org
>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>
>>
>>
>>
>> --
>> -------------------------------------------------
>> Hady El-Sahar
>> Research Assistant
>> Center of Informatics Sciences | Nile 
>> University<http://nileuniversity.edu.eg/>
>>
>>
>>
>
>
> --
> -------------------------------------------------
> Hady El-Sahar
> Research Assistant
> Center of Informatics Sciences | Nile 
> University<http://nileuniversity.edu.eg/>
>
>
>
>
> ------------------------------------------------------------------------------
> Shape the Mobile Experience: Free Subscription
> Software experts and developers: Be at the forefront of tech innovation.
> Intel(R) Software Adrenaline delivers strategic insight and game-changing
> conversations that shape the rapidly evolving mobile landscape. Sign up
> now.
> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
>


-- 
Dimitris Kontokostas
Department of Computer Science, University of Leipzig
Research Group: http://aksw.org
Homepage:http://aksw.org/DimitrisKontokostas

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Refactoring the extraction Framework core to accept new formats

Reply via email to