On 30 July 2010 09:42, Francis Tyers <[email protected]> wrote: > El dv 30 de 07 de 2010 a les 08:40 +0200, en/na Mikel L. Forcada va > escriure: >> Hi Fran, >> >> > You can also add 'vbmod' and 'vbser'. >> > >> I will see what I get by doing that. I need to study the Welsh verb >> system a bit, before I make a decision (I add 600 more forms). >> > >> >> It is not clear to me whether we should have apostrophes in the list and >> >> how. It depends on the way the tokenization goes before the chunkers are >> >> applied, and how chunkers treat apostrophes. In this case, I should >> >> filter the file later on. >> >> >> > >> > There are lots of conjoined forms, which I'm not sure how you deal with >> > too... e.g. >> > >> > taswn:pe<cnjsub>+bod<vbser><plu><p1><sg> >> > tasai:pe<cnjsub>+bod<vbser><plu><p3><sg> >> > tasen:pe<cnjsub>+bod<vbser><plu><p1><pl> >> > tasen:pe<cnjsub>+bod<vbser><plu><p3><pl> >> > tasech:pe<cnjsub>+bod<vbser><plu><p2><pl> >> > >> They are candidates to start a chunk in my approximation, because they >> start with a subordinating conjunction. >> >> taswn:C:Sub_C >> tasai:C:Sub_C >> tasen:C:Sub_C >> tasen:C:Sub_C >> tasech:C:Sub_C >> >> > It might be an idea just to pre-tokenise, but then if you're working on >> > surface forms I'm not sure how you'd do that. >> > >> No tokenisation (beyond spaces, etc.) is performed in a typical >> OpenMaTrEx run. As you said, it is "surface form" stuff. >> > Also, how to deal with spaces, e.g. "dwn i ddim" >> > >> I don't extract this one, but with >> >> lt-expand apertium-cy-en.cy.dix | egrep ":[[:alpha:] >> ]+(<pr>|<cnjcoo>|<det>|<cnjsub>|<num>)" | awk >> 'BEGIN{FS="(:([<>]:)?|[<>])"} {print$1":""z"$3"z"}' | sed >> 's/zprz/P/g;s/zcnjcooz/C:Cor_CONJ/;s/zdetz/D/g;s/zcnjsubz/C:Sub_C/g;s/znumz/:Q:Card_NUM/g' >> >> I do have multiwords such as "i mewn i'r:P", etc. > > But this one ends with a determiner I think, e.g. "into the", should it > still be tagged as a preposition ? > >> BTW, is there a Breton-French corpus that could be used too? > > Yes, it isn't so big, but there is one available. > > http://elx.dlsi.ua.es/~fran/brfr_OAB_corpus.tgz > > If you know anyone who has any experience converting documents to > parallel corpora (e.g. microsoft 'word' documents and 'excel' > spreadsheets), then I can send over a lot of raw text to be processed.
I might be able to get TMX via TEI, but I've never tried before. If you send me a pair of each (word and excel), I'll try it out. -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. ------------------------------------------------------------------------------ The Palm PDK Hot Apps Program offers developers who use the Plug-In Development Kit to bring their C/C++ apps to Palm for a share of $1 Million in cash or HP Products. Visit us here for more details: http://p.sf.net/sfu/dev2dev-palm _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
