Re: [Dbpedia-developers] Refactoring Core to accept new formats [Updates]

Hady elsahar Tue, 10 Dec 2013 04:11:15 -0800

Hi all ,

Latest changes


1- pulled changes from master branch after Merge With the Dump branch
2- solved merge conflicts ( the remote master branch with local changes of
core refactoring )
3- core now builds correctly and tested on sample enwiki and Wikidata dumbs

Related commits :  http://bit.ly/1hK77qH , http://bit.ly/1hK7bXx ,
http://bit.ly/1hK7d1s , http://bit.ly/1hK7gdI

thanks
Regards





On Tue, Nov 26, 2013 at 5:19 PM, Hady elsahar <[email protected]> wrote:

> Hi all ,
>
> i guess we have here a working Draft for the refactored core Loaded into
> it the WikiData Extraction process
> [See 
> Commit<https://github.com/hadyelsahar/extraction-framework/commit/9c2bd62a9a71e0a2a29fd268fccc4b9187758e7e>
> ]
>
> changes are :
>
> 1- added JsonNode class to hold  Json Values when the Wikipage of Json
> format is parsed
> 2- added Extractors[JsonNode] , for Extraction of Wikidata Triples
>  (WikidataLLExtractor , WikidataLabelsExtractor , ...etc )
> 3- new Datasets for the new Extractor in DBpediaDatasets.scala
> 4- updated JsonWikiParser to Return JsonNode object contained parsed Json
>
> ps: the Design of the WikidataExtraction process was developed to suit the
> old design of the Core , we don't need that in the moment after the core
> has changed , some of the next steps would be improving the design of the
> WikidataExtraction  ( for example the Parser returns generic JValue instead
> of JsonNode class)
>
> ps-2 : i've tested the WikiDataExtractors on sample of the extracted dumb
> at 
> 20130818<https://dl.dropboxusercontent.com/u/45056835/wikidatawiki-20130818-pages-meta-hist-incr.xml>
>   - the internal JSON format of Wikidata has changed a little since then ,
> hence recent dumps will raise exceptions in the Json parser
>
>
> thanks,
> Regards
>
>
> On Tue, Nov 26, 2013 at 10:55 AM, Dimitris Kontokostas <
> [email protected]> wrote:
>
>> Hi Hady,
>>
>>
>> On Sun, Nov 24, 2013 at 9:40 PM, Hady elsahar <[email protected]>wrote:
>>
>>> Hello All ,
>>>
>>> considering the issue 
>>> #38<https://github.com/dbpedia/extraction-framework/issues/38> refactoring 
>>> the core to accept new formats , i guess the new core
>>> functionality is working now , what's needed is some modifications as well
>>> as your suggestions for updates and of course merging to the main branch
>>>
>>> what was done so far :
>>>
>>> 1- change Extractor Trait to accept [T] type argument [see 
>>> commit<https://github.com/hadyelsahar/extraction-framework/commit/e26ef813dad098d573be34191dfaef13c78b5986>
>>> ]
>>> 2- change CompostiteExtractor class to load any type of classes not only
>>> PageNode [see 
>>> commit<https://github.com/hadyelsahar/extraction-framework/commit/17dcaa8b2988e7fc8676532fa849fff1eabec9d0>
>>> ]
>>>
>>> 3- Refactoring the core [see commit 
>>> <https://github.com/hadyelsahar/extraction-framework/commit/9ad75cd864d12025d2872b4e3c6cbe4d4fae3681>
>>> ]
>>>
>>>    - added  (loadToParsers) method to CompositeExtractor this method
>>>    will :
>>>
>>>    - take a list of Extractors and split them by the type they accepts
>>>       - create JsonParseExtractor object and load it with
>>>       Extractor[Json format]
>>>       - create WikiParseExtractor  object and load it with
>>>       Extractor[PageNode]
>>>       - create CompositeExtractor object and load it with
>>>       Extractor[WikiPage]
>>>
>>>       - Created ParseExtractor class which :
>>>
>>>    - takes WikiPageFormat  as an argument and decide suitable parser
>>>       for it
>>>       - get loaded with Extractors
>>>       - in runtime check if page has proper WikiPageFormat if so ,parse
>>>       it by the parse and pass it to all inner Extractors
>>>       - WikiParseExtractor , CompositeExtractor are instances of the
>>>       same class ParseExtractor  but with different WikiPageFormat Argument
>>>
>>> good!
>>
>> *Next Steps : *
>>>
>>> 1- Loading WikiData Extractors created in the GSoC project to this branch
>>>
>>
>> go ahead
>>
>> 2- in CompositeExtractor , in order we check for  Extractor[T] , T is
>>> erased in runtime so the cleanest way is to use Scala TypeTag which need
>>> scala 2.10 so :
>>>
>>>    - as a work around i added a Type enumerator at Extractor Class
>>>    - future work would be installing scala 2.10 , then replacing the
>>>    enum with check for TypeTags
>>>
>>>  We talked about this and we both don't like it :)
>> creating super classes for WikiPageExtractor, PageNodeExtractor,
>> JsonExtractor would result in less code but since we'll change it anyway in
>> 2.10 leave it like this and we will fix it after the merge
>>
>>
>>> 3- Get rid of the RootExtractor
>>>
>>> *Questions:*
>>> 1- Any suggestions or modifications needed ?
>>>
>>
>> I think there are some things that could be improved but we need to see
>> the whole picture first. Let's not waste further time discussing design, go
>> ahead and create a working draft first and we can always improve later
>>
>>  2- the only difference now than  JC's 
>> Design<https://f.cloud.github.com/assets/607468/363286/1f8da62c-a1ff-11e2-99c3-bb5136accc07.png>
>>  is
>>> that PraseExtractor passes WikiPage to all inner Extractor instead of
>>> collecting them in one CompositeExtractor
>>> it doesn't really add any new functionality just following the pattern .
>>> so do you think we should add it ?
>>>
>>
>> I think my comment above covers your question :)
>>
>> Good work Hady!
>>
>> Best,
>> Dimitris
>>
>>>
>>>
>>> thanks
>>> Regards
>>>
>>> -------------------------------------------------
>>> Hady El-Sahar
>>> Research Assistant
>>> Center of Informatics Sciences | Nile 
>>> University<http://nileuniversity.edu.eg/>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Shape the Mobile Experience: Free Subscription
>>> Software experts and developers: Be at the forefront of tech innovation.
>>> Intel(R) Software Adrenaline delivers strategic insight and game-changing
>>> conversations that shape the rapidly evolving mobile landscape. Sign up
>>> now.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Dbpedia-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>
>>>
>>
>>
>> --
>> Dimitris Kontokostas
>> Department of Computer Science, University of Leipzig
>> Research Group: http://aksw.org
>> Homepage:http://aksw.org/DimitrisKontokostas
>>
>
>
>
> --
> -------------------------------------------------
> Hady El-Sahar
> Research Assistant
> Center of Informatics Sciences | Nile 
> University<http://nileuniversity.edu.eg/>
>
>
>


-- 
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] Refactoring Core to accept new formats [Updates]

Reply via email to