Am 05.03.2012 16:59, schrieb Pablo Mendes: > Right. If lexic.tsv contains <count,uri,surfaceForm>, and these counts > came from the Wikipedia paragraphs (occs.uriSorted.tsv) than I'd say > you're doing it right. Do make sure you merge the (uri->sf) entries > coming from occurrences with the ones coming from titles, redirects > and disambiguations (TRDs), though. You can choose if you want to do > it before or after counting. Merging before counting means that you do > not give any special weight to TRDs. Merging after counting means that > you consider TRDs to be a special class of mappings that deserve to be > included even if they are not frequently occurring (e.g. helps with > sparsity but may include spurious mappings). > > See (latest revision): > https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/index.sh >
hi pablo, i have just discovered a minor problem with this script https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/getSurfaceFormMapFromOccs.sh cat output/occs.uriSorted.tsv | cut -d$'\t' -f 2,3 > output/surfaceForms-fromOccs.tsv IndexLingPipeSpotter expects the surface forms at index 0. But this tool here writes the surface form to index 1 and the title to index 0. Finally i end up with dictionary entries containing underlines _ when combining surface forms from TitRedDis and Occs. A very simple fix would be to change the line to cat output/occs.uriSorted.tsv | cut -d$'\t' -f 3,2 > output/surfaceForms-fromOccs.tsv or? best regards reinhard ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Dbp-spotlight-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
