Hi all ,

i guess we have here a working Draft for the refactored core Loaded into it
the WikiData Extraction process
[See 
Commit<https://github.com/hadyelsahar/extraction-framework/commit/9c2bd62a9a71e0a2a29fd268fccc4b9187758e7e>
]

changes are :

1- added JsonNode class to hold  Json Values when the Wikipage of Json
format is parsed
2- added Extractors[JsonNode] , for Extraction of Wikidata Triples
 (WikidataLLExtractor , WikidataLabelsExtractor , ...etc )
3- new Datasets for the new Extractor in DBpediaDatasets.scala
4- updated JsonWikiParser to Return JsonNode object contained parsed Json

ps: the Design of the WikidataExtraction process was developed to suit the
old design of the Core , we don't need that in the moment after the core
has changed , some of the next steps would be improving the design of the
WikidataExtraction  ( for example the Parser returns generic JValue instead
of JsonNode class)

ps-2 : i've tested the WikiDataExtractors on sample of the extracted dumb
at 
20130818<https://dl.dropboxusercontent.com/u/45056835/wikidatawiki-20130818-pages-meta-hist-incr.xml>
 - the internal JSON format of Wikidata has changed a little since
then ,
hence recent dumps will raise exceptions in the Json parser


thanks,
Regards


On Tue, Nov 26, 2013 at 10:55 AM, Dimitris Kontokostas <
[email protected]> wrote:

> Hi Hady,
>
>
> On Sun, Nov 24, 2013 at 9:40 PM, Hady elsahar <[email protected]>wrote:
>
>> Hello All ,
>>
>> considering the issue 
>> #38<https://github.com/dbpedia/extraction-framework/issues/38> refactoring 
>> the core to accept new formats , i guess the new core
>> functionality is working now , what's needed is some modifications as well
>> as your suggestions for updates and of course merging to the main branch
>>
>> what was done so far :
>>
>> 1- change Extractor Trait to accept [T] type argument [see 
>> commit<https://github.com/hadyelsahar/extraction-framework/commit/e26ef813dad098d573be34191dfaef13c78b5986>
>> ]
>> 2- change CompostiteExtractor class to load any type of classes not only
>> PageNode [see 
>> commit<https://github.com/hadyelsahar/extraction-framework/commit/17dcaa8b2988e7fc8676532fa849fff1eabec9d0>
>> ]
>>
>> 3- Refactoring the core [see commit 
>> <https://github.com/hadyelsahar/extraction-framework/commit/9ad75cd864d12025d2872b4e3c6cbe4d4fae3681>
>> ]
>>
>>    - added  (loadToParsers) method to CompositeExtractor this method
>>    will :
>>
>>    - take a list of Extractors and split them by the type they accepts
>>       - create JsonParseExtractor object and load it with Extractor[Json
>>       format]
>>       - create WikiParseExtractor  object and load it with
>>       Extractor[PageNode]
>>       - create CompositeExtractor object and load it with
>>       Extractor[WikiPage]
>>
>>       - Created ParseExtractor class which :
>>
>>    - takes WikiPageFormat  as an argument and decide suitable parser for
>>       it
>>       - get loaded with Extractors
>>       - in runtime check if page has proper WikiPageFormat if so ,parse
>>       it by the parse and pass it to all inner Extractors
>>       - WikiParseExtractor , CompositeExtractor are instances of the
>>       same class ParseExtractor  but with different WikiPageFormat Argument
>>
>> good!
>
> *Next Steps : *
>>
>> 1- Loading WikiData Extractors created in the GSoC project to this branch
>>
>
> go ahead
>
> 2- in CompositeExtractor , in order we check for  Extractor[T] , T is
>> erased in runtime so the cleanest way is to use Scala TypeTag which need
>> scala 2.10 so :
>>
>>    - as a work around i added a Type enumerator at Extractor Class
>>    - future work would be installing scala 2.10 , then replacing the
>>    enum with check for TypeTags
>>
>> We talked about this and we both don't like it :)
> creating super classes for WikiPageExtractor, PageNodeExtractor,
> JsonExtractor would result in less code but since we'll change it anyway in
> 2.10 leave it like this and we will fix it after the merge
>
>
>> 3- Get rid of the RootExtractor
>>
>> *Questions:*
>> 1- Any suggestions or modifications needed ?
>>
>
> I think there are some things that could be improved but we need to see
> the whole picture first. Let's not waste further time discussing design, go
> ahead and create a working draft first and we can always improve later
>
> 2- the only difference now than  JC's 
> Design<https://f.cloud.github.com/assets/607468/363286/1f8da62c-a1ff-11e2-99c3-bb5136accc07.png>
>  is
>> that PraseExtractor passes WikiPage to all inner Extractor instead of
>> collecting them in one CompositeExtractor
>> it doesn't really add any new functionality just following the pattern .
>> so do you think we should add it ?
>>
>
> I think my comment above covers your question :)
>
> Good work Hady!
>
> Best,
> Dimitris
>
>>
>>
>> thanks
>> Regards
>>
>> -------------------------------------------------
>> Hady El-Sahar
>> Research Assistant
>> Center of Informatics Sciences | Nile 
>> University<http://nileuniversity.edu.eg/>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Shape the Mobile Experience: Free Subscription
>> Software experts and developers: Be at the forefront of tech innovation.
>> Intel(R) Software Adrenaline delivers strategic insight and game-changing
>> conversations that shape the rapidly evolving mobile landscape. Sign up
>> now.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>>
>
>
> --
> Dimitris Kontokostas
> Department of Computer Science, University of Leipzig
> Research Group: http://aksw.org
> Homepage:http://aksw.org/DimitrisKontokostas
>



-- 
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>
------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Reply via email to