Tino Didriksen <[email protected]> čálii: >> On the other hand the format seems simple and it is clear parsing it with >> any programming language is not that hard. Everyone says they have just >> come up with some of their own methods, but then there are quite many >> corner cases with the way output varies, so reinventing how to parse this >> format again seems a bit unnecessary. I would normally work further with >> the results in R and Python, so getting the output without information loss >> into any of these would do.
If you want to use some ready-made tools to get things into Python, you could use cg-conv and Apertium's streamparser.py Here's a small session showing its usage: ``` $ git clone https://github.com/goavki/streamparser Cloning into 'streamparser'... remote: Counting objects: 142, done. remote: Compressing objects: 100% (2/2), done. remote: Total 142 (delta 0), reused 0 (delta 0), pack-reused 139 Receiving objects: 100% (142/142), 33.99 KiB | 280.00 KiB/s, done. Resolving deltas: 100% (76/76), done. $ cd streamparser/ $ cat /tmp/kom "<карын>" "кар" Hom1 N Sg Ine @HNOUN #1->0 "<and>" "and" CC @Conj "<so>" "so" Adv <guess> "so" PreAdv @Thing "<on>" "on" Adv @Other "on" Pr @Meh $ cat /tmp/kom | cg-conv -A ^карын/кар<Hom1><N><Sg><Ine><#1->0><@HNOUN>$^and/and<CC><@Conj>$^so/so<Adv><<guess>>/so<PreAdv><@Thing>$^on/on<Adv><@Other>/on<Pr><@Meh>$$ $ # And now to transform into whatever structure we want in Python, say "form\tsyntags\tmain-pos": $ cat /tmp/kom | cg-conv -A | python3 -c 'import streamparser import sys for blank, lu in streamparser.parse_file(sys.stdin, withText=True): print(blank+lu.wordform,end="\t") tags = [tag for reading in lu.readings for sub in reading for tag in sub.tags] print([t for t in tags if t.startswith("@")], end="\t") print([t for t in tags if t in ["N", "Adv", "Pr", "PreAdv"]], end="\n") ' карын ['@HNOUN'] ['N'] and ['@Conj'] [] so ['@Thing'] ['Adv', 'PreAdv'] on ['@Other', '@Meh'] ['Adv', 'Pr'] ``` I don't know what information "cg-conv -A" loses, but it does keep the important stuff, e.g. lemma, wordform, readings, subreadings and even "blanks/formatting in between cohorts. best regards, Kevin Brubeck Unhammer -- You received this message because you are subscribed to the Google Groups "Constraint Grammar" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/constraint-grammar. For more options, visit https://groups.google.com/d/optout.
