On 5 March 2012 15:59, Pablo Mendes <[email protected]> wrote: > > >> I've been using >> awk -F'\t' '($1>=3){print $0}' < lexic.tsv >> >> where lexic.tsv is the input to >> org.dbpedia.spotlight.util.CreateLexicalizations - I guess now is a >> good time to find out if I'm doing it wrong :) > > > Right. If lexic.tsv contains <count,uri,surfaceForm>, and these counts came > from the Wikipedia paragraphs (occs.uriSorted.tsv) than I'd say you're doing > it right. Do make sure you merge the (uri->sf) entries coming from > occurrences with the ones coming from titles, redirects and disambiguations > (TRDs), though.
Aha. I had been missing that step. Also, while we're on this topic, I notice that things like '[[las]]ach' are being extracted with the surface form 'las', and not 'lasach', as I'd expected. I guess it's not necessary for the DBpedia extraction framework, and ISTR that the relevant piece of Mediawiki was particularly horrible, but it's something that may be worth adding to a FAQ. > You can choose if you want to do it before or after > counting. Merging before counting means that you do not give any special > weight to TRDs. Merging after counting means that you consider TRDs to be a > special class of mappings that deserve to be included even if they are not > frequently occurring (e.g. helps with sparsity but may include spurious > mappings). > > See (latest revision): > https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/index.sh > > I do a basic concatenation there. This means that occurrences in Wikipedia > pointing at redirects and disambiguations would be missed. Best would be to > extend ExtractCandidateMap to already read in the occs, and do the same job > we currently do with cut/sort/grep/sed, plus the transitive closure of URIs. > We would love if anybody volunteered to send us that patch. > ( https://sourceforge.net/tracker/?func=detail&aid=3497056&group_id=399595&atid=1657035 > ) Otherwise, whenever I have some time I'll work on it and include it in the > next release. Might be worth making a list of project ideas, big and small. "I wanted to contribute, but I didn't know where to start" is a common enough reason given for not contributing to open source. -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Dbp-spotlight-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
