Am 05.03.2012 16:59, schrieb Pablo Mendes:
> Right. If lexic.tsv contains <count,uri,surfaceForm>, and these counts
> came from the Wikipedia paragraphs (occs.uriSorted.tsv) than I'd say
> you're doing it right. Do make sure you merge the (uri->sf) entries
> coming from occurrences with the ones coming from titles, redirects
> and disambiguations (TRDs), though. You can choose if you want to do
> it before or after counting. Merging before counting means that you do
> not give any special weight to TRDs. Merging after counting means that
> you consider TRDs to be a special class of mappings that deserve to be
> included even if they are not frequently occurring (e.g. helps with
> sparsity but may include spurious mappings).
>
> See (latest revision):
> https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/index.sh
>

hi pablo,

i have just discovered a minor problem with this script

https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/getSurfaceFormMapFromOccs.sh

cat output/occs.uriSorted.tsv | cut -d$'\t' -f 2,3 > 
output/surfaceForms-fromOccs.tsv


IndexLingPipeSpotter expects the surface forms at index 0.
But this tool here writes the surface form to index 1 and the title to
index 0.
Finally i end up with dictionary entries containing underlines _ when
combining surface forms from TitRedDis and Occs.
A very simple fix would be to change the line to

cat output/occs.uriSorted.tsv | cut -d$'\t' -f 3,2 > 
output/surfaceForms-fromOccs.tsv

or?

best regards
reinhard



------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Reply via email to