I didn't dig in the logic behind the SKOS categories extractor. Maybe @Chris has the answer?
On 6/27/13 5:32 PM, Andrea Di Menna wrote: > Kasun, > > the SkosCategoriesExtractor will produce the file I mentioned some mails > ago [1] > This is why I don't think we need another extractor to process category > data from wikipedia articles, unless the current code is wrong of course :) > > I personally think you could start working directly on the skos > hierarchy and focus on the interesting part of your investigation, i.e. > leaves and parents-with-leaves-only-as-children > > WDYT? Mentors? > > Regards > Andrea > > [1] http://wiki.dbpedia.org/Downloads38#categories-skos > > > 2013/6/27 kasun perera <[email protected] > <mailto:[email protected]>> > > > Hi Andrea > > On Thu, Jun 27, 2013 at 8:43 PM, Andrea Di Menna <[email protected] > <mailto:[email protected]>> wrote: > > > [1] org.dbpedia.extraction.mappings.SkosCategoriesExtractor > > > Actually I didn't know it was there, that's why I didn't try/mention > it. If it is the correct option I would use it :) > > thanks > > > > > 2013/6/27 kasun perera <[email protected] > <mailto:[email protected]>> > > > On Thu, Jun 27, 2013 at 1:42 PM, Alessio Palmero Aprosio > <[email protected] <mailto:[email protected]>> wrote: > > Dear Kasun, > I had to deal with the same problem some months ago, and > I managed to use the XML article file: you can intercept > categories using the "Category:" prefix, and you can > infer father-son relation using the <title> tag (if the > <title> starts with "Category:", all the categories for > this page are possible ancestors). > The Wikipedia category taxonomy is quite a mess, so good > luck! > > Alessio > > > > Hi Alessio > > Yes I would try this.Seems this is a good option. I hope > this is the correct file > "enwiki-20130604-pages-articles.xml.bz2 > > <http://dumps.wikimedia.org/enwiki/20130604/enwiki-20130604-pages-articles.xml.bz2>9.2 > GB" that you are refering. > > @marco > Is it good idea to try several options (1-What I have said > in previously and 2-Aleseio's suggestion 3- any other > option) and do some evaluation to find out what is best > method for getting leaf nodes? May be it would give the same > output? > > Thanks > > > > Il 27/06/13 05:24, kasun perera ha scritto: >> As discussed with Marco these are the next tasks that >> i would be working. >> >> 1. Identification of leaf categories >> 2. Prominent leaves discovery >> 3. Pages clustering based on prominent leaves >> >> For above task 1, I'm planing to use Wikipedia >> category and category_links SQL tables available here. >> http://dumps.wikimedia.org/enwiki/20130604/ >> >> above dump files are somewhat larger 20mb and 1.2gb in >> size respectively. >> I'm thinking of putting these data in to a MySql >> database and do the processing rather than process >> these files in-memory. Also the amount of leaf >> categories and prominent nodes would be large and need >> to be push to a MySql tables. >> >> I want to know whether this code should be write under >> extraction-framwork code,if so where should I plug >> this code? >> or whether is it good idea to write it separately, and >> push to a new repo? If I write it separately can I use >> a language other than Scala? >> >> >> -- >> Regards >> >> Kasun Perera >> >> >> >> >> ------------------------------------------------------------------------------ >> This SF.net email is sponsored by Windows: >> >> Build for Windows Store. >> >> http://p.sf.net/sfu/windows-dev2dev >> >> >> _______________________________________________ >> Dbpedia-developers mailing list >> [email protected] >> <mailto:[email protected]> >> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers > > > > > -- > Regards > > Kasun Perera > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Dbpedia-developers mailing list > [email protected] > <mailto:[email protected]> > https://lists.sourceforge.net/lists/listinfo/dbpedia-developers > > > > > > -- > Regards > > Kasun Perera > > -- Marco Fossati http://about.me/marco.fossati Twitter: @hjfocs Skype: hell_j ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Dbpedia-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
