Re: [Dbpedia-developers] Refactoring the extraction Framework core to accept new formats

Hady elsahar Tue, 19 Nov 2013 06:52:07 -0800

some questions again

in ConfigLoader.scala


private val parser = WikiParser.getInstance(config.parser)


getinstance is handled by in WikiParserWrapper.scala
<https://github.com/ninniuz/extraction-framework/blob/6cb6a7b5ebe65a6804ef9bb43d05fdf72b55c577/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/WikiParserWrapper.scala>
which
is responsible to choose which Parser to return depending on the given name
 as well as the WikiPage.format

case WikiPageFormat.WikiText =>
>          if (wikiTextParserName == null ||
> wikiTextParserName.equals("simple")){
>            simpleWikiParser(page)
>          } else {
>            swebleWikiParser(page)
>          }



1- I guess what JC's meant in his Design by  WikiParser is the
SimpleWikiParser  ?
because the 'wikiTextParserName'  that is always send from the Config.scala
 is  "simple"

  val parser = config.getProperty("parser", "simple")


so should i use only the simpleWikiParser hardcoded or should i check for
the name stored first in the config.scala ?




On Tue, Nov 19, 2013 at 1:13 PM, Hady elsahar <[email protected]> wrote:

> Hi Dimitris ,
>
> for the sake of update and to catch early issues , here's the updates so
> far
>
>    -
>
>    change Extractor Trait to accept [T] type argument [see 
> commit<https://github.com/hadyelsahar/extraction-framework/commit/e26ef813dad098d573be34191dfaef13c78b5986>
>    ]
>    - change all existing Extractors to accept type PageNode
>       - change functions in config.scala to load Extractors of type 'any'
>       - check compositeExtractor.scala to check for Extractor Type
>       - run and check that update works fine
>    -
>
>    change CompostiteExtractor class to load any type of classes not only
>    PageNode [see 
> commit<https://github.com/hadyelsahar/extraction-framework/commit/17dcaa8b2988e7fc8676532fa849fff1eabec9d0>
>    ]
>
> thanks
> Regards
>
>
>
> On Mon, Nov 18, 2013 at 8:09 AM, Dimitris Kontokostas <
> [email protected]> wrote:
>
>>
>>
>>
>> On Mon, Nov 18, 2013 at 7:04 AM, Hady elsahar <[email protected]>wrote:
>>
>>> *Some Questions : *
>>>
>>> 1-
>>> in the Trait Extractor , that all our extractors implements , if we
>>> changed it from :
>>>
>>> trait Extractor extends Mapping[PageNode]
>>>
>>> to
>>>
>>> trait Extractor [T] extends Mapping[T]
>>>
>>>
>>> we will need to refactor all Extractor classes to add which type of data
>>> they accept
>>> do you is this ok ?
>>>
>>
>> I think this is a good choice, it will also help catch errors at compile
>> time and 'T' will never change for an Extractror
>>
>>
>>> i tried a for some time to tweak it using Scala upper and lower type
>>> bound to make PageNode the Default type when the type is not set , but i
>>> didn't manage to do it . ( but we wouldn't need of course that if we added
>>> the type to all existing constructors)
>>>
>>
>> I think lower/upper bounds work for sub-super types only. Maybe there is
>> a scala tweak here that I am not aware of but if no one objects, let's keep
>> it simple.
>>
>>  Best,
>> Dimitris
>>
>>
>>
>>
>>>
>>>
>>>
>>>
>>> On Tue, Nov 12, 2013 at 9:32 AM, Dimitris Kontokostas <
>>> [email protected]> wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Sun, Nov 10, 2013 at 7:23 PM, Hady elsahar <[email protected]>wrote:
>>>>
>>>>> Hello All,
>>>>>
>>>>> in order to merge the 
>>>>> code<https://github.com/hadyelsahar/extraction-framework/commits/parseJson>written
>>>>>  for the GSoC project for wikidata Extraction process , we need
>>>>> first to work on issue #38 - Refactoring the core to accept new
>>>>> formats <https://github.com/dbpedia/extraction-framework/issues/38>
>>>>> by referring to JC's suggestion  here 
>>>>> <https://github.com/dbpedia/extraction-framework/pull/35#issuecomment-16187074>
>>>>>  and
>>>>> diagram 
>>>>> here<https://github.com/dbpedia/extraction-framework/pull/35#issuecomment-16187074>
>>>>>
>>>>>
>>>>> below some points that we may face :
>>>>>
>>>>>    - in the Design JC suggested , CompositeExtractor sometimes
>>>>>    accepts JValue or Wikipage or PageNode . we have two alternatives to
>>>>>    implement this:
>>>>>    - handling this automatically by checking what type does each of
>>>>>       the inner Extractors Accepts , call the parser for it and pass 
>>>>> suitable
>>>>>       data to the inner extractor
>>>>>       - handling this by hardcoding ie. makiing
>>>>>       JValueCompositeExtractor . PageNodeCompositeExtractor ..etc , 
>>>>> either by
>>>>>       Templating or creating subclasses
>>>>>
>>>>> I think that the first is the goal but I wouldn't mind if you started
>>>> with the second approach if it makes it easier for you. Once we have it
>>>> running we can refactor later
>>>>
>>>>>
>>>>>
>>>>>
>>>>>    - also in the old Design it was like this :
>>>>>       - once we create new Extractor to run it we add it to the
>>>>>       config File
>>>>>       - ConfigLoader loads it inside the CompositeExtractor
>>>>>       - WikiParserWrapper
>>>>>       
>>>>> <https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/WikiParserWrapper.scala>Decides
>>>>>       which parser would be activated
>>>>>
>>>>> we have to tweak this a little bit in our new design to allow the new
>>>>> Level of CompositeExtractors to choose which Extractors to load and which
>>>>> not
>>>>>
>>>>>
>>>>> so why wouldn't our design be :
>>>>>
>>>>>
>>>>> - allow CompositeExtractors to accept and pass only WikiPage objects
>>>>> to it's inner extractors
>>>>>
>>>>>
>>>> This is what it does in JC's design, there is just an extra level on
>>>> CompositeExtractors (more later)
>>>>
>>>>
>>>>>  - Devise way to Map Extractors and ParserExtractors to PageType (an
>>>>> Enum in the Extractor Class and define it in the Subclasses )
>>>>>
>>>>> - ConfigLoader :
>>>>>
>>>>> - loads all Extractors from config file
>>>>> - creates two ParseExtractors (JSONParseExtractor ,
>>>>> WikiParseExtractor)
>>>>> - check type of Each needed extractor if it's JSON , Load it to the
>>>>> JsonParseExtractor , if it's WikiText Load it to the WikiParseExtractor
>>>>>
>>>>>
>>>> This can be done by the first CompositeExtractor.
>>>> gather all Extractor[WikiPage], encapsulate them in a
>>>> compositeExtractor & extract them
>>>> gather all Extractor[PageNode] and encapsulate them in a
>>>> compositeExtractor and pass them to WikitextParseExtractor.
>>>>            WikitextParseExtractor: If page type is WikiText, parse it
>>>> and pass a PageNode to all enabled extractors[PageNode]. Otherwise return
>>>> an empty Quad list
>>>> Similar for the JsonValueParseExtractor
>>>>
>>>> This way you don't have to change anything in the configuration
>>>> loading, just move the parsing step further down
>>>>
>>>>
>>>>>  - load JsonParserExtractor , WikiParseExtractor , other extractors to
>>>>> a CompositeExtractor
>>>>>
>>>>> - CompositeExtractor :
>>>>>
>>>>> - send Wikipage to all inner Extractor objects (JsonParseExtractor ,
>>>>> WikiParseExtractor , other normal Extractors)
>>>>>
>>>>> - JsonParseExtractor :
>>>>>
>>>>> - If page format is JSON, run WikiPage object through JSON parser and
>>>>> pass JValue to all inner Extractors
>>>>>
>>>>> - Otherwise, do nothing
>>>>>
>>>>> - WikitextParsingExtractor:
>>>>>
>>>>> - If page format is wikitext, run WikiPage object through WikiParser
>>>>> and pass PageNode to all inner Extractors
>>>>>
>>>>> - Otherwise, do nothing
>>>>>
>>>>>
>>>>> - WikiparserWrapper functionality will be obsolete because as JC
>>>>> suggested to each parser will check page format if it's of the same type
>>>>> parses it ,if not do nothing so we remove it
>>>>>
>>>>>
>>>>> Pros would be :
>>>>>
>>>>>    - simpler Design , less number of classes , less changes as well
>>>>>    - skip Extra level of composite extractors that doesn't add any
>>>>>    functionality
>>>>>    - overcome the part of different inputs and outputs for
>>>>>    CompositeExtractor
>>>>>    - same configFiles would work
>>>>>
>>>>> Cons would be :
>>>>>
>>>>>    - maybe it's confusing that ParseExtractor contains as well inner
>>>>>    Extractors
>>>>>    - more functionality in the ConfigLoader
>>>>>    - we should specify for each of the Extractors what kind of pages
>>>>>    it needs to Receive
>>>>>
>>>>>
>>>>>
>>>> Maybe I misunderstood but the only change I can see in you diagram with
>>>> JC's is a level of CompositeExtractor.
>>>> imo this is just a design pattern that helps encapsulate multiple
>>>> extractors for a parser and if this is your only concern we can skip this
>>>> for now.
>>>>
>>>> Cheers,
>>>> Dimitris
>>>>
>>>>
>>>>>  thanks
>>>>> Regards
>>>>>
>>>>> -------------------------------------------------
>>>>> Hady El-Sahar
>>>>> Research Assistant
>>>>> Center of Informatics Sciences | Nile 
>>>>> University<http://nileuniversity.edu.eg/>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> November Webinars for C, C++, Fortran Developers
>>>>> Accelerate application performance with scalable programming models.
>>>>> Explore
>>>>> techniques for threading, error checking, porting, and tuning. Get the
>>>>> most
>>>>> from the latest Intel processors and coprocessors. See abstracts and
>>>>> register
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Dbpedia-developers mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Dimitris Kontokostas
>>>> Department of Computer Science, University of Leipzig
>>>> Research Group: http://aksw.org
>>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>>
>>>
>>>
>>>
>>> --
>>> -------------------------------------------------
>>> Hady El-Sahar
>>> Research Assistant
>>> Center of Informatics Sciences | Nile 
>>> University<http://nileuniversity.edu.eg/>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps
>>> OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access
>>> Free app hosting. Or install the open source package on any LAMP server.
>>> Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk
>>>
>>> _______________________________________________
>>> Dbpedia-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>
>>>
>>
>>
>> --
>> Dimitris Kontokostas
>> Department of Computer Science, University of Leipzig
>> Research Group: http://aksw.org
>> Homepage:http://aksw.org/DimitrisKontokostas
>>
>
>
>
> --
> -------------------------------------------------
> Hady El-Sahar
> Research Assistant
> Center of Informatics Sciences | Nile 
> University<http://nileuniversity.edu.eg/>
>
>
>


-- 
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Refactoring the extraction Framework core to accept new formats

Reply via email to