Hi all ,
Latest changes
1- pulled changes from master branch after Merge With the Dump branch
2- solved merge conflicts ( the remote master branch with local changes of
core refactoring )
3- core now builds correctly and tested on sample enwiki and Wikidata dumbs
Related commits : http://bit.ly/1hK77qH , http://bit.ly/1hK7bXx ,
http://bit.ly/1hK7d1s , http://bit.ly/1hK7gdI
thanks
Regards
On Tue, Nov 26, 2013 at 5:19 PM, Hady elsahar <[email protected]> wrote:
> Hi all ,
>
> i guess we have here a working Draft for the refactored core Loaded into
> it the WikiData Extraction process
> [See
> Commit<https://github.com/hadyelsahar/extraction-framework/commit/9c2bd62a9a71e0a2a29fd268fccc4b9187758e7e>
> ]
>
> changes are :
>
> 1- added JsonNode class to hold Json Values when the Wikipage of Json
> format is parsed
> 2- added Extractors[JsonNode] , for Extraction of Wikidata Triples
> (WikidataLLExtractor , WikidataLabelsExtractor , ...etc )
> 3- new Datasets for the new Extractor in DBpediaDatasets.scala
> 4- updated JsonWikiParser to Return JsonNode object contained parsed Json
>
> ps: the Design of the WikidataExtraction process was developed to suit the
> old design of the Core , we don't need that in the moment after the core
> has changed , some of the next steps would be improving the design of the
> WikidataExtraction ( for example the Parser returns generic JValue instead
> of JsonNode class)
>
> ps-2 : i've tested the WikiDataExtractors on sample of the extracted dumb
> at
> 20130818<https://dl.dropboxusercontent.com/u/45056835/wikidatawiki-20130818-pages-meta-hist-incr.xml>
> - the internal JSON format of Wikidata has changed a little since then ,
> hence recent dumps will raise exceptions in the Json parser
>
>
> thanks,
> Regards
>
>
> On Tue, Nov 26, 2013 at 10:55 AM, Dimitris Kontokostas <
> [email protected]> wrote:
>
>> Hi Hady,
>>
>>
>> On Sun, Nov 24, 2013 at 9:40 PM, Hady elsahar <[email protected]>wrote:
>>
>>> Hello All ,
>>>
>>> considering the issue
>>> #38<https://github.com/dbpedia/extraction-framework/issues/38> refactoring
>>> the core to accept new formats , i guess the new core
>>> functionality is working now , what's needed is some modifications as well
>>> as your suggestions for updates and of course merging to the main branch
>>>
>>> what was done so far :
>>>
>>> 1- change Extractor Trait to accept [T] type argument [see
>>> commit<https://github.com/hadyelsahar/extraction-framework/commit/e26ef813dad098d573be34191dfaef13c78b5986>
>>> ]
>>> 2- change CompostiteExtractor class to load any type of classes not only
>>> PageNode [see
>>> commit<https://github.com/hadyelsahar/extraction-framework/commit/17dcaa8b2988e7fc8676532fa849fff1eabec9d0>
>>> ]
>>>
>>> 3- Refactoring the core [see commit
>>> <https://github.com/hadyelsahar/extraction-framework/commit/9ad75cd864d12025d2872b4e3c6cbe4d4fae3681>
>>> ]
>>>
>>> - added (loadToParsers) method to CompositeExtractor this method
>>> will :
>>>
>>> - take a list of Extractors and split them by the type they accepts
>>> - create JsonParseExtractor object and load it with
>>> Extractor[Json format]
>>> - create WikiParseExtractor object and load it with
>>> Extractor[PageNode]
>>> - create CompositeExtractor object and load it with
>>> Extractor[WikiPage]
>>>
>>> - Created ParseExtractor class which :
>>>
>>> - takes WikiPageFormat as an argument and decide suitable parser
>>> for it
>>> - get loaded with Extractors
>>> - in runtime check if page has proper WikiPageFormat if so ,parse
>>> it by the parse and pass it to all inner Extractors
>>> - WikiParseExtractor , CompositeExtractor are instances of the
>>> same class ParseExtractor but with different WikiPageFormat Argument
>>>
>>> good!
>>
>> *Next Steps : *
>>>
>>> 1- Loading WikiData Extractors created in the GSoC project to this branch
>>>
>>
>> go ahead
>>
>> 2- in CompositeExtractor , in order we check for Extractor[T] , T is
>>> erased in runtime so the cleanest way is to use Scala TypeTag which need
>>> scala 2.10 so :
>>>
>>> - as a work around i added a Type enumerator at Extractor Class
>>> - future work would be installing scala 2.10 , then replacing the
>>> enum with check for TypeTags
>>>
>>> We talked about this and we both don't like it :)
>> creating super classes for WikiPageExtractor, PageNodeExtractor,
>> JsonExtractor would result in less code but since we'll change it anyway in
>> 2.10 leave it like this and we will fix it after the merge
>>
>>
>>> 3- Get rid of the RootExtractor
>>>
>>> *Questions:*
>>> 1- Any suggestions or modifications needed ?
>>>
>>
>> I think there are some things that could be improved but we need to see
>> the whole picture first. Let's not waste further time discussing design, go
>> ahead and create a working draft first and we can always improve later
>>
>> 2- the only difference now than JC's
>> Design<https://f.cloud.github.com/assets/607468/363286/1f8da62c-a1ff-11e2-99c3-bb5136accc07.png>
>> is
>>> that PraseExtractor passes WikiPage to all inner Extractor instead of
>>> collecting them in one CompositeExtractor
>>> it doesn't really add any new functionality just following the pattern .
>>> so do you think we should add it ?
>>>
>>
>> I think my comment above covers your question :)
>>
>> Good work Hady!
>>
>> Best,
>> Dimitris
>>
>>>
>>>
>>> thanks
>>> Regards
>>>
>>> -------------------------------------------------
>>> Hady El-Sahar
>>> Research Assistant
>>> Center of Informatics Sciences | Nile
>>> University<http://nileuniversity.edu.eg/>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Shape the Mobile Experience: Free Subscription
>>> Software experts and developers: Be at the forefront of tech innovation.
>>> Intel(R) Software Adrenaline delivers strategic insight and game-changing
>>> conversations that shape the rapidly evolving mobile landscape. Sign up
>>> now.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Dbpedia-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>
>>>
>>
>>
>> --
>> Dimitris Kontokostas
>> Department of Computer Science, University of Leipzig
>> Research Group: http://aksw.org
>> Homepage:http://aksw.org/DimitrisKontokostas
>>
>
>
>
> --
> -------------------------------------------------
> Hady El-Sahar
> Research Assistant
> Center of Informatics Sciences | Nile
> University<http://nileuniversity.edu.eg/>
>
>
>
--
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>
------------------------------------------------------------------------------
Sponsored by Intel(R) XDK
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers