On 7/5/13 3:26 PM, Andrea Di Menna wrote: > Hi Marco, > > $ bzgrep "1612_establishments_in_Mexico" skos_categories_en.ttl.bz2 > <http://dbpedia.org/resource/Category:1612_establishments_in_Mexico> > <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> > <http://www.w3.org/2004/02/skos/core#Concept> . > <http://dbpedia.org/resource/Category:1612_establishments_in_Mexico> > <http://www.w3.org/2004/02/skos/core#prefLabel> "1612 establishments in > Mexico"@en . > > So the category referred to in the issue seems to be there in the file. @Kasun, can you check this please? > > Would you mind: > 1. Amending the issue with the correct datafile? (it refers to N-Triples > version and not the Turtle one - i.e. > http://downloads.dbpedia.org/3.8/en/skos_categories_en.nt.bz2) Sorry, bad copy/paste in the issue. Fixed, thanks for pointing it out :-) > 2. Checking again the statement "categories that don't have a broader > category are not included in dump 2" which seems to be wrong > > Is it possible that what you were stating is that categories which are > referred to in the articles are not all in the skos file maybe? > > Cheers > Andrea > > > 2013/7/5 Marco Fossati <[email protected] <mailto:[email protected]>> > > Hi Andrea, > > Yo are checking the wrong datasets. > We are using [1] and [2]. > Cheers, > > [1] > > http://downloads.dbpedia.org/__preview.php?file=3.8_sl_en_sl___article_categories_en.ttl.bz2 > > <http://downloads.dbpedia.org/preview.php?file=3.8_sl_en_sl_article_categories_en.ttl.bz2> > [2] > > http://downloads.dbpedia.org/__preview.php?file=3.8_sl_en_sl___skos_categories_en.ttl.bz2 > > <http://downloads.dbpedia.org/preview.php?file=3.8_sl_en_sl_skos_categories_en.ttl.bz2> > > > On 7/5/13 3:17 PM, Andrea Di Menna wrote: > > On top of this: > > $ bzgrep "1612_establishments_in___Mexico" skos_categories_en.nt.bz2 > > <http://dbpedia.org/resource/__Category:1612_establishments___in_Mexico > <http://dbpedia.org/resource/Category:1612_establishments_in_Mexico>> > <http://www.w3.org/1999/02/22-__rdf-syntax-ns#type > <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> > <http://www.w3.org/2004/02/__skos/core#Concept > <http://www.w3.org/2004/02/skos/core#Concept>> . > > So the category you are referring to in [1] is present in > skos_categories. > It does not have a parent category in this case, but I think this is > because parent categories are added using a template which might be > unsupported in the extraction framework: > {{estcatCountry|161|2|Mexico}} > > Am I wrong? > > [1] https://github.com/dbpedia/__dbpedia-links/issues/16 > <https://github.com/dbpedia/dbpedia-links/issues/16> > > > 2013/7/5 Andrea Di Menna <[email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>> > > > Hi Marco, Kasun, > > I am not sure I am understanding now. > > I have not verified directly yet your statement about missing > categories in the skos data, but after a rough check > > $ zcat skos_categories_en.nt.gz | grep -v "^#" | cut -d">" > -f1 | > sort | uniq | wc -l > 862826 > $ zcat category_labels_en.nt.gz | grep -v "^#" | sort | > uniq | wc -l > 862826 > > That suggests all the categories compare in the > skos_categories (I > presume as skos:Concept). > > What confuses me about your statement is that: > "categories that don't have a broader category are not > included in > dump 2" (i.e. skos_categories file) but then you say that > > http://en.wikipedia.org/wiki/__Category:1612_establishments___in_Mexico > <http://en.wikipedia.org/wiki/Category:1612_establishments_in_Mexico> > which > has a parent > > http://en.wikipedia.org/wiki/__Category:1610s_establishments___in_Mexico > <http://en.wikipedia.org/wiki/Category:1610s_establishments_in_Mexico> > does > not appear in skos_categories. > Does that mean also categories which have broader > categories are not > included in skos_categories? > Could you please elaborate? > > Kasun, for what regards data freshness (Reason 2) you can > recreate a > new version of DBpedia dataset for your personal use > whenever you > prefer (using the extraction framework). > > Cheers > Andrea > > > 2013/7/5 Marco Fossati <[email protected] > <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > > > Hi Kasun, > > On 7/5/13 2:17 PM, kasun perera wrote: > > I was working with skos_categories but these are > some reasons > that I > > would avoid using that dataset for parent-child > relationship > detection. > > > > Reason 1 > > I need to get all leaf categories AND thier child- > parent > relationships. > > Categories that don't have a broader category are > not included in > > skos_category dump. This claim is discussed here > > https://github.com/dbpedia/__dbpedia-links/issues/16 > <https://github.com/dbpedia/dbpedia-links/issues/16> > This is a finding that should be documented if not > already done. > Could you please check and eventually update the > documentation? > Also, if this is not the intended behavior of the category > extractor, it > should be reported as an issue in the > extractor-framework repo. > > Thanks Kasun for these useful analytics. > Cheers! > > > > Reason 2 > > We need to concern about data freshness dealing with > tasks > related to > > knowledge representation. Debpedia latest dumps > (1.8) are > nealy one year > > older. This work also need to deal with other > datasets such > as Wikipedia > > page_edit_history, interlaguage links ect. So there > is the > need that all > > the datasets are in sync with each other, i.e. they > have the > same > > dates. If I use dbpedia dumps there is a problem of > finding > synchronized > > datasets. > > > > > > > > On Fri, Jun 28, 2013 at 3:33 PM, Alessio Palmero Aprosio > <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>>> wrote: > > > > Dear Kasun, > > I'm investigating graph DBs in this period, but > I haven't > tried any yet. > > In my implementation, I'm using a Lucene index > to store > categories. > > I have two fields: category name and parent. The > parent > is null if > > there is no parent at all. > > Whenever I need a path, I start from the > category and go for > > parents. If I encounter a category I already > encountered > before, I > > stop the loop (otherwise it will go on forever). > > > > You also can use a simple MySQL database with > two fields, > but I > > think Lucene is faster. > > > > Alessio > > > > > > Il 28/06/13 10:25, kasun perera ha scritto: > >> Hi Alessio > >> > >> On Thu, Jun 27, 2013 at 1:42 PM, Alessio > Palmero Aprosio > >> <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>>> wrote: > >> > >> Dear Kasun, > >> I had to deal with the same problem some > months ago, > >> > >> > >> Just curious about how did you stored the edges > and vertices > >> relationships when processing the categories. > >> In-memory processing would be difficult since > it has a > huge number > >> of edges and vertices, so I think it's good to > store > them in a > >> database. > >> I have heard about graph databases[1], but haven't > worked with > >> them. Did you use something like that or simple > mysql > database? > >> > >> > [1]http://en.wikipedia.org/__wiki/Graph_database > <http://en.wikipedia.org/wiki/Graph_database> > >> > >> and I managed to use the XML article file: > you can > intercept > >> categories using the "Category:" prefix, > and you can > infer > >> father-son relation using the <title> tag > (if the > <title> > >> starts with "Category:", all the categories > for this > page are > >> possible ancestors). > >> The Wikipedia category taxonomy is quite a > mess, so > good luck! > >> > >> Alessio > >> > >> > >> Il 27/06/13 05:24, kasun perera ha scritto: > >>> As discussed with Marco these are the next > tasks > that i would > >>> be working. > >>> > >>> 1. Identification of leaf categories > >>> 2. Prominent leaves discovery > >>> 3. Pages clustering based on prominent leaves > >>> > >>> For above task 1, I'm planing to use Wikipedia > category and > >>> category_links SQL tables available here. > >>> http://dumps.wikimedia.org/__enwiki/20130604/ > <http://dumps.wikimedia.org/enwiki/20130604/> > >>> > >>> above dump files are somewhat larger 20mb > and 1.2gb > in size > >>> respectively. > >>> I'm thinking of putting these data in to a > MySql > database and > >>> do the processing rather than process > these files > in-memory. > >>> Also the amount of leaf categories and > prominent > nodes would > >>> be large and need to be push to a MySql > tables. > >>> > >>> I want to know whether this code should be > write under > >>> extraction-framwork code,if so where > should I plug > this code? > >>> or whether is it good idea to write it > separately, > and push > >>> to a new repo? If I write it separately > can I use a > language > >>> other than Scala? > >>> > >>> > >>> -- > >>> Regards > >>> > >>> Kasun Perera > >>> > >>> > >>> > >>> > > > ------------------------------__------------------------------__------------------ > >>> This SF.net email is sponsored by Windows: > >>> > >>> Build for Windows Store. > >>> > >>> http://p.sf.net/sfu/windows-__dev2dev > <http://p.sf.net/sfu/windows-dev2dev> > >>> > >>> > >>> > _________________________________________________ > >>> Dbpedia-developers mailing list > >>> Dbpedia-developers@lists.__sourceforge.net > <mailto:[email protected]> > <mailto:Dbpedia-developers@__lists.sourceforge.net > <mailto:[email protected]>> > <mailto:Dbpedia-developers@__lists.sourceforge.net > <mailto:[email protected]> > > <mailto:Dbpedia-developers@__lists.sourceforge.net > <mailto:[email protected]>>> > >>> > https://lists.sourceforge.net/__lists/listinfo/dbpedia-__developers > <https://lists.sourceforge.net/lists/listinfo/dbpedia-developers> > >> > >> > >> > >> > >> -- > >> Regards > >> > >> Kasun Perera > >> > > > > > > > > > > -- > > Regards > > > > Kasun Perera > > > > -- > Marco Fossati > http://about.me/marco.fossati > Twitter: @hjfocs > Skype: hell_j > > > > ------------------------------__------------------------------__------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-__dev2dev > <http://p.sf.net/sfu/windows-dev2dev> > _________________________________________________ > Dbpedia-developers mailing list > Dbpedia-developers@lists.__sourceforge.net > <mailto:[email protected]> > <mailto:Dbpedia-developers@__lists.sourceforge.net > <mailto:[email protected]>> > https://lists.sourceforge.net/__lists/listinfo/dbpedia-__developers > <https://lists.sourceforge.net/lists/listinfo/dbpedia-developers> > > > > > -- > Marco Fossati > http://about.me/marco.fossati > Twitter: @hjfocs > Skype: hell_j > >
-- Marco Fossati http://about.me/marco.fossati Twitter: @hjfocs Skype: hell_j ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Dbpedia-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
