Tino Didriksen <[email protected]> čálii:

>> On the other hand the format seems simple and it is clear parsing it with
>> any programming language is not that hard. Everyone says they have just
>> come up with some of their own methods, but then there are quite many
>> corner cases with the way output varies, so reinventing how to parse this
>> format again seems a bit unnecessary. I would normally work further with
>> the results in R and Python, so getting the output without information loss
>> into any of these would do.

If you want to use some ready-made tools to get things into Python, you
could use cg-conv and Apertium's streamparser.py

Here's a small session showing its usage:

```
$ git clone https://github.com/goavki/streamparser
Cloning into 'streamparser'...
remote: Counting objects: 142, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 142 (delta 0), reused 0 (delta 0), pack-reused 139
Receiving objects: 100% (142/142), 33.99 KiB | 280.00 KiB/s, done.
Resolving deltas: 100% (76/76), done.
$ cd streamparser/
$ cat /tmp/kom
"<карын>"
        "кар" Hom1 N Sg Ine @HNOUN #1->0
"<and>"
        "and" CC @Conj

"<so>"
        "so" Adv <guess>
        "so" PreAdv @Thing

"<on>"
        "on" Adv @Other
        "on" Pr @Meh
$ cat /tmp/kom | cg-conv -A
^карын/кар<Hom1><N><Sg><Ine><#1->0><@HNOUN>$^and/and<CC><@Conj>$^so/so<Adv><<guess>>/so<PreAdv><@Thing>$^on/on<Adv><@Other>/on<Pr><@Meh>$$
 
$ # And now to transform into whatever structure we want in Python, say 
"form\tsyntags\tmain-pos":
$ cat /tmp/kom | cg-conv -A | python3 -c 'import streamparser
import sys
for blank, lu in streamparser.parse_file(sys.stdin, withText=True):
  print(blank+lu.wordform,end="\t")
  tags = [tag for reading in lu.readings for sub in reading for tag in sub.tags]
  print([t for t in tags if t.startswith("@")], end="\t")
  print([t for t in tags if t in ["N", "Adv", "Pr", "PreAdv"]], end="\n")
'
карын   ['@HNOUN']      ['N']
and     ['@Conj']       []
so      ['@Thing']      ['Adv', 'PreAdv']
on      ['@Other', '@Meh']      ['Adv', 'Pr']
```


I don't know what information "cg-conv -A" loses, but it does keep the
important stuff, e.g. lemma, wordform, readings, subreadings and even
"blanks/formatting in between cohorts.


best regards,
Kevin Brubeck Unhammer 

-- 
You received this message because you are subscribed to the Google Groups 
"Constraint Grammar" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/constraint-grammar.
For more options, visit https://groups.google.com/d/optout.

Reply via email to