The language is determined by scoring the text against each installed language. Take the following example:
"Ich have a brown dog." Most of these words exist in the English dictionary but only one exists in German. The dictionary will therefore use English to determine which of the text tokens are valid words. The one word in German (Ich) is likely due to a quality problem with the OCR. It's not actually German text, and "Ich" should not be treated as a valid word in this case. Had the rest of the words been German, "Ich" should be treated as valid. Keep in mind that the language packs must be copied to $FELIX/conf/dictionaries/ . Once the contents of those csv files are loaded into the database, the files will be deleted. This loading process needs to be done only once on a single worker node. Josh On Jun 29, 2011, at 8:58 AM, matpro_fhkoeln wrote: > Hello Ladies and Gentlemen, > > According to > http://opencast.jira.com/wiki/display/MHDOC/Configure+Text+Analysis+v1.1 > "Matterhorn can support any number of language packs concurrently, > and will attempt to determine the most appropriate language for each > video segment it analyzes." > > Besides, these three language packs > http://downloads.opencastproject.org/artifacts/ > are enclosed in a fresh matterhorn installation. > > Path to csv-files seems slightly different: > > matpro@pips03:~$ ls -l /opt/matterhorn/felix/conf/dictionaries/ > insgesamt 0 > > root@pips03:/home/matpro# find / -name de.csv > /opt/matterhorn/1.1.0/docs/felix/conf/dictionaries/de.csv > > matpro@pips03:~$ ls -l /opt/matterhorn/1.1.0/docs/felix/conf/dictionaries/ > insgesamt 324 > -rw-r--r-- 1 matpro matpro 107141 17. Jun 15:05 de.csv > -rw-r--r-- 1 matpro matpro 99998 17. Jun 15:05 en.csv > -rw-r--r-- 1 matpro matpro 104690 17. Jun 15:05 es.csv > > So which criteria is used to determine the exact language pack? > Is this detected through the media title? > > Thank you in advance, > regards, > > [email protected] > > _______________________________________________ > Community mailing list > [email protected] > http://lists.opencastproject.org/mailman/listinfo/community > > > To unsubscribe please email > [email protected] > _______________________________________________ _______________________________________________ Matterhorn-users mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
