Re: [Dbpedia-developers] Refactoring the extraction Framework core to accept new formats

Hady elsahar Tue, 19 Nov 2013 03:13:53 -0800

Hi Dimitris ,

for the sake of update and to catch early issues , here's the updates so
far


   -

   change Extractor Trait to accept [T] type argument [see
commit<https://github.com/hadyelsahar/extraction-framework/commit/e26ef813dad098d573be34191dfaef13c78b5986>
   ]
   - change all existing Extractors to accept type PageNode
      - change functions in config.scala to load Extractors of type 'any'
      - check compositeExtractor.scala to check for Extractor Type
      - run and check that update works fine
   -

   change CompostiteExtractor class to load any type of classes not only
   PageNode [see
commit<https://github.com/hadyelsahar/extraction-framework/commit/17dcaa8b2988e7fc8676532fa849fff1eabec9d0>
   ]

thanks
Regards



On Mon, Nov 18, 2013 at 8:09 AM, Dimitris Kontokostas <
[email protected]> wrote:

>
>
>
> On Mon, Nov 18, 2013 at 7:04 AM, Hady elsahar <[email protected]>wrote:
>
>> *Some Questions : *
>>
>> 1-
>> in the Trait Extractor , that all our extractors implements , if we
>> changed it from :
>>
>> trait Extractor extends Mapping[PageNode]
>>
>> to
>>
>> trait Extractor [T] extends Mapping[T]
>>
>>
>> we will need to refactor all Extractor classes to add which type of data
>> they accept
>> do you is this ok ?
>>
>
> I think this is a good choice, it will also help catch errors at compile
> time and 'T' will never change for an Extractror
>
>
>> i tried a for some time to tweak it using Scala upper and lower type
>> bound to make PageNode the Default type when the type is not set , but i
>> didn't manage to do it . ( but we wouldn't need of course that if we added
>> the type to all existing constructors)
>>
>
> I think lower/upper bounds work for sub-super types only. Maybe there is a
> scala tweak here that I am not aware of but if no one objects, let's keep
> it simple.
>
> Best,
> Dimitris
>
>
>
>
>>
>>
>>
>>
>> On Tue, Nov 12, 2013 at 9:32 AM, Dimitris Kontokostas <
>> [email protected]> wrote:
>>
>>>
>>>
>>>
>>> On Sun, Nov 10, 2013 at 7:23 PM, Hady elsahar <[email protected]>wrote:
>>>
>>>> Hello All,
>>>>
>>>> in order to merge the 
>>>> code<https://github.com/hadyelsahar/extraction-framework/commits/parseJson>written
>>>>  for the GSoC project for wikidata Extraction process , we need
>>>> first to work on issue #38 - Refactoring the core to accept new 
>>>> formats<https://github.com/dbpedia/extraction-framework/issues/38>
>>>> by referring to JC's suggestion  here 
>>>> <https://github.com/dbpedia/extraction-framework/pull/35#issuecomment-16187074>
>>>>  and
>>>> diagram 
>>>> here<https://github.com/dbpedia/extraction-framework/pull/35#issuecomment-16187074>
>>>>
>>>>
>>>> below some points that we may face :
>>>>
>>>>    - in the Design JC suggested , CompositeExtractor sometimes accepts
>>>>    JValue or Wikipage or PageNode . we have two alternatives to implement 
>>>> this:
>>>>    - handling this automatically by checking what type does each of
>>>>       the inner Extractors Accepts , call the parser for it and pass 
>>>> suitable
>>>>       data to the inner extractor
>>>>       - handling this by hardcoding ie. makiing
>>>>       JValueCompositeExtractor . PageNodeCompositeExtractor ..etc , either 
>>>> by
>>>>       Templating or creating subclasses
>>>>
>>>> I think that the first is the goal but I wouldn't mind if you started
>>> with the second approach if it makes it easier for you. Once we have it
>>> running we can refactor later
>>>
>>>>
>>>>
>>>>
>>>>    - also in the old Design it was like this :
>>>>       - once we create new Extractor to run it we add it to the config
>>>>       File
>>>>       - ConfigLoader loads it inside the CompositeExtractor
>>>>       - WikiParserWrapper
>>>>       
>>>> <https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/WikiParserWrapper.scala>Decides
>>>>       which parser would be activated
>>>>
>>>> we have to tweak this a little bit in our new design to allow the new
>>>> Level of CompositeExtractors to choose which Extractors to load and which
>>>> not
>>>>
>>>>
>>>> so why wouldn't our design be :
>>>>
>>>>
>>>> - allow CompositeExtractors to accept and pass only WikiPage objects to
>>>> it's inner extractors
>>>>
>>>>
>>> This is what it does in JC's design, there is just an extra level on
>>> CompositeExtractors (more later)
>>>
>>>
>>>>  - Devise way to Map Extractors and ParserExtractors to PageType (an
>>>> Enum in the Extractor Class and define it in the Subclasses )
>>>>
>>>> - ConfigLoader :
>>>>
>>>> - loads all Extractors from config file
>>>> - creates two ParseExtractors (JSONParseExtractor , WikiParseExtractor)
>>>> - check type of Each needed extractor if it's JSON , Load it to the
>>>> JsonParseExtractor , if it's WikiText Load it to the WikiParseExtractor
>>>>
>>>>
>>> This can be done by the first CompositeExtractor.
>>> gather all Extractor[WikiPage], encapsulate them in a compositeExtractor
>>> & extract them
>>> gather all Extractor[PageNode] and encapsulate them in a
>>> compositeExtractor and pass them to WikitextParseExtractor.
>>>            WikitextParseExtractor: If page type is WikiText, parse it
>>> and pass a PageNode to all enabled extractors[PageNode]. Otherwise return
>>> an empty Quad list
>>> Similar for the JsonValueParseExtractor
>>>
>>> This way you don't have to change anything in the configuration loading,
>>> just move the parsing step further down
>>>
>>>
>>>>  - load JsonParserExtractor , WikiParseExtractor , other extractors to
>>>> a CompositeExtractor
>>>>
>>>> - CompositeExtractor :
>>>>
>>>> - send Wikipage to all inner Extractor objects (JsonParseExtractor ,
>>>> WikiParseExtractor , other normal Extractors)
>>>>
>>>> - JsonParseExtractor :
>>>>
>>>> - If page format is JSON, run WikiPage object through JSON parser and
>>>> pass JValue to all inner Extractors
>>>>
>>>> - Otherwise, do nothing
>>>>
>>>> - WikitextParsingExtractor:
>>>>
>>>> - If page format is wikitext, run WikiPage object through WikiParser
>>>> and pass PageNode to all inner Extractors
>>>>
>>>> - Otherwise, do nothing
>>>>
>>>>
>>>> - WikiparserWrapper functionality will be obsolete because as JC
>>>> suggested to each parser will check page format if it's of the same type
>>>> parses it ,if not do nothing so we remove it
>>>>
>>>>
>>>> Pros would be :
>>>>
>>>>    - simpler Design , less number of classes , less changes as well
>>>>    - skip Extra level of composite extractors that doesn't add any
>>>>    functionality
>>>>    - overcome the part of different inputs and outputs for
>>>>    CompositeExtractor
>>>>    - same configFiles would work
>>>>
>>>> Cons would be :
>>>>
>>>>    - maybe it's confusing that ParseExtractor contains as well inner
>>>>    Extractors
>>>>    - more functionality in the ConfigLoader
>>>>    - we should specify for each of the Extractors what kind of pages
>>>>    it needs to Receive
>>>>
>>>>
>>>>
>>> Maybe I misunderstood but the only change I can see in you diagram with
>>> JC's is a level of CompositeExtractor.
>>> imo this is just a design pattern that helps encapsulate multiple
>>> extractors for a parser and if this is your only concern we can skip this
>>> for now.
>>>
>>> Cheers,
>>> Dimitris
>>>
>>>
>>>>  thanks
>>>> Regards
>>>>
>>>> -------------------------------------------------
>>>> Hady El-Sahar
>>>> Research Assistant
>>>> Center of Informatics Sciences | Nile 
>>>> University<http://nileuniversity.edu.eg/>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> November Webinars for C, C++, Fortran Developers
>>>> Accelerate application performance with scalable programming models.
>>>> Explore
>>>> techniques for threading, error checking, porting, and tuning. Get the
>>>> most
>>>> from the latest Intel processors and coprocessors. See abstracts and
>>>> register
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Dbpedia-developers mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>
>>>>
>>>
>>>
>>> --
>>> Dimitris Kontokostas
>>> Department of Computer Science, University of Leipzig
>>> Research Group: http://aksw.org
>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>
>>
>>
>>
>> --
>> -------------------------------------------------
>> Hady El-Sahar
>> Research Assistant
>> Center of Informatics Sciences | Nile 
>> University<http://nileuniversity.edu.eg/>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps
>> OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access
>> Free app hosting. Or install the open source package on any LAMP server.
>> Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk
>>
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>>
>
>
> --
> Dimitris Kontokostas
> Department of Computer Science, University of Leipzig
> Research Group: http://aksw.org
> Homepage:http://aksw.org/DimitrisKontokostas
>



-- 
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Refactoring the extraction Framework core to accept new formats

Reply via email to