On 30 July 2010 09:42, Francis Tyers <[email protected]> wrote:
> El dv 30 de 07 de 2010 a les 08:40 +0200, en/na Mikel L. Forcada va
> escriure:
>> Hi Fran,
>>
>> > You can also add 'vbmod' and 'vbser'.
>> >
>> I will see what I get by doing that. I need to study the Welsh verb
>> system a bit, before I make a decision (I add 600 more forms).
>> >
>> >> It is not clear to me whether we should have apostrophes in the list and
>> >> how. It depends on the way the tokenization goes before the chunkers are
>> >> applied, and how chunkers treat apostrophes. In this case, I should
>> >> filter the file later on.
>> >>
>> >
>> > There are lots of conjoined forms, which I'm not sure how you deal with
>> > too... e.g.
>> >
>> > taswn:pe<cnjsub>+bod<vbser><plu><p1><sg>
>> > tasai:pe<cnjsub>+bod<vbser><plu><p3><sg>
>> > tasen:pe<cnjsub>+bod<vbser><plu><p1><pl>
>> > tasen:pe<cnjsub>+bod<vbser><plu><p3><pl>
>> > tasech:pe<cnjsub>+bod<vbser><plu><p2><pl>
>> >
>> They are candidates to start a chunk in my approximation, because they
>> start with a subordinating conjunction.
>>
>> taswn:C:Sub_C
>> tasai:C:Sub_C
>> tasen:C:Sub_C
>> tasen:C:Sub_C
>> tasech:C:Sub_C
>>
>> > It might be an idea just to pre-tokenise, but then if you're working on
>> > surface forms I'm not sure how you'd do that.
>> >
>> No tokenisation (beyond spaces, etc.) is performed in a typical
>> OpenMaTrEx run. As you said, it is "surface form" stuff.
>> > Also, how to deal with spaces, e.g. "dwn i ddim"
>> >
>> I don't extract this one, but with
>>
>> lt-expand apertium-cy-en.cy.dix | egrep ":[[:alpha:]
>> ]+(<pr>|<cnjcoo>|<det>|<cnjsub>|<num>)" | awk
>> 'BEGIN{FS="(:([<>]:)?|[<>])"} {print$1":""z"$3"z"}' | sed
>> 's/zprz/P/g;s/zcnjcooz/C:Cor_CONJ/;s/zdetz/D/g;s/zcnjsubz/C:Sub_C/g;s/znumz/:Q:Card_NUM/g'
>>
>> I do have multiwords such as "i mewn i'r:P", etc.
>
> But this one ends with a determiner I think, e.g. "into the", should it
> still be tagged as a preposition ?
>
>> BTW, is there a Breton-French corpus that could be used too?
>
> Yes, it isn't so big, but there is one available.
>
> http://elx.dlsi.ua.es/~fran/brfr_OAB_corpus.tgz
>
> If you know anyone who has any experience converting documents to
> parallel corpora (e.g. microsoft 'word' documents and 'excel'
> spreadsheets), then I can send over a lot of raw text to be processed.

I might be able to get TMX via TEI, but I've never tried before. If
you send me a pair of each (word and excel), I'll try it out.


-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to