Hi all ,
i guess we have here a working Draft for the refactored core Loaded into it
the WikiData Extraction process
[See
Commit<https://github.com/hadyelsahar/extraction-framework/commit/9c2bd62a9a71e0a2a29fd268fccc4b9187758e7e>
]
changes are :
1- added JsonNode class to hold Json Values when the Wikipage of Json
format is parsed
2- added Extractors[JsonNode] , for Extraction of Wikidata Triples
(WikidataLLExtractor , WikidataLabelsExtractor , ...etc )
3- new Datasets for the new Extractor in DBpediaDatasets.scala
4- updated JsonWikiParser to Return JsonNode object contained parsed Json
ps: the Design of the WikidataExtraction process was developed to suit the
old design of the Core , we don't need that in the moment after the core
has changed , some of the next steps would be improving the design of the
WikidataExtraction ( for example the Parser returns generic JValue instead
of JsonNode class)
ps-2 : i've tested the WikiDataExtractors on sample of the extracted dumb
at
20130818<https://dl.dropboxusercontent.com/u/45056835/wikidatawiki-20130818-pages-meta-hist-incr.xml>
- the internal JSON format of Wikidata has changed a little since
then ,
hence recent dumps will raise exceptions in the Json parser
thanks,
Regards
On Tue, Nov 26, 2013 at 10:55 AM, Dimitris Kontokostas <
[email protected]> wrote:
> Hi Hady,
>
>
> On Sun, Nov 24, 2013 at 9:40 PM, Hady elsahar <[email protected]>wrote:
>
>> Hello All ,
>>
>> considering the issue
>> #38<https://github.com/dbpedia/extraction-framework/issues/38> refactoring
>> the core to accept new formats , i guess the new core
>> functionality is working now , what's needed is some modifications as well
>> as your suggestions for updates and of course merging to the main branch
>>
>> what was done so far :
>>
>> 1- change Extractor Trait to accept [T] type argument [see
>> commit<https://github.com/hadyelsahar/extraction-framework/commit/e26ef813dad098d573be34191dfaef13c78b5986>
>> ]
>> 2- change CompostiteExtractor class to load any type of classes not only
>> PageNode [see
>> commit<https://github.com/hadyelsahar/extraction-framework/commit/17dcaa8b2988e7fc8676532fa849fff1eabec9d0>
>> ]
>>
>> 3- Refactoring the core [see commit
>> <https://github.com/hadyelsahar/extraction-framework/commit/9ad75cd864d12025d2872b4e3c6cbe4d4fae3681>
>> ]
>>
>> - added (loadToParsers) method to CompositeExtractor this method
>> will :
>>
>> - take a list of Extractors and split them by the type they accepts
>> - create JsonParseExtractor object and load it with Extractor[Json
>> format]
>> - create WikiParseExtractor object and load it with
>> Extractor[PageNode]
>> - create CompositeExtractor object and load it with
>> Extractor[WikiPage]
>>
>> - Created ParseExtractor class which :
>>
>> - takes WikiPageFormat as an argument and decide suitable parser for
>> it
>> - get loaded with Extractors
>> - in runtime check if page has proper WikiPageFormat if so ,parse
>> it by the parse and pass it to all inner Extractors
>> - WikiParseExtractor , CompositeExtractor are instances of the
>> same class ParseExtractor but with different WikiPageFormat Argument
>>
>> good!
>
> *Next Steps : *
>>
>> 1- Loading WikiData Extractors created in the GSoC project to this branch
>>
>
> go ahead
>
> 2- in CompositeExtractor , in order we check for Extractor[T] , T is
>> erased in runtime so the cleanest way is to use Scala TypeTag which need
>> scala 2.10 so :
>>
>> - as a work around i added a Type enumerator at Extractor Class
>> - future work would be installing scala 2.10 , then replacing the
>> enum with check for TypeTags
>>
>> We talked about this and we both don't like it :)
> creating super classes for WikiPageExtractor, PageNodeExtractor,
> JsonExtractor would result in less code but since we'll change it anyway in
> 2.10 leave it like this and we will fix it after the merge
>
>
>> 3- Get rid of the RootExtractor
>>
>> *Questions:*
>> 1- Any suggestions or modifications needed ?
>>
>
> I think there are some things that could be improved but we need to see
> the whole picture first. Let's not waste further time discussing design, go
> ahead and create a working draft first and we can always improve later
>
> 2- the only difference now than JC's
> Design<https://f.cloud.github.com/assets/607468/363286/1f8da62c-a1ff-11e2-99c3-bb5136accc07.png>
> is
>> that PraseExtractor passes WikiPage to all inner Extractor instead of
>> collecting them in one CompositeExtractor
>> it doesn't really add any new functionality just following the pattern .
>> so do you think we should add it ?
>>
>
> I think my comment above covers your question :)
>
> Good work Hady!
>
> Best,
> Dimitris
>
>>
>>
>> thanks
>> Regards
>>
>> -------------------------------------------------
>> Hady El-Sahar
>> Research Assistant
>> Center of Informatics Sciences | Nile
>> University<http://nileuniversity.edu.eg/>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Shape the Mobile Experience: Free Subscription
>> Software experts and developers: Be at the forefront of tech innovation.
>> Intel(R) Software Adrenaline delivers strategic insight and game-changing
>> conversations that shape the rapidly evolving mobile landscape. Sign up
>> now.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>>
>
>
> --
> Dimitris Kontokostas
> Department of Computer Science, University of Leipzig
> Research Group: http://aksw.org
> Homepage:http://aksw.org/DimitrisKontokostas
>
--
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>
------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing
conversations that shape the rapidly evolving mobile landscape. Sign up now.
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers