(CC'ing the Constraint Grammar mailing list) So, from my point of view that's a very simple topic, but unfortunately not in a way that helps you. CG-3 doesn't care about most things in the stream.
As you identified, http://visl.sdu.dk/cg3/chunked/streamformats.html#stream-vislcg is the documentation for the stream format, and that's really it. I've added stream static tags just now, but they don't alter much. CG-3 mostly does not care what order tags are in or how those tags look. As long as there's a baseform first, the remaining tags are a random bag. Your example "кар" Hom1 N Sg Ine @HNOUN #1->0 is to CG-3 the same as "кар" @HNOUN #1->0 Sg N Ine Hom1, so what you consider validation is far beyond of what I consider validation. Each parsing system has their own tags and tag order, and CG-3 tries to maintain those tags and order but doesn't really care about it. This is also why there is no CoNLL-U converter directly in CG-3. CoNLL-U mandates many tag patterns and orders that CG-3 simply doesn't care about or even knows about - I can't make a general-purpose converter, because each parsing system wants it differently. I could quite easily convert to XML or JSON, but how much that would help is I think limited. It'd be something like XML: <cohort id="1" parent="0"> <wordform>...</wordform> <static-tags><tag>...</tag><tag>...</tag></static-tags> <readings> <reading><baseform>...</baseform><tag>...</tag><tag>...</tag><tag>...</tag><mapping>...</mapping></reading> <reading><baseform>...</baseform><tag>...</tag><tag>...</tag><mapping>...</mapping></reading> </readings> </cohort> JSON: { wordform: "...", static_tags: ["...", "..."], readings: [ {baseform: "...", tags: ["...", "...", "..."], mapping: "..."}, {baseform: "...", tags: ["...", "..."], mapping: "..."}, ] } (With everything abbreviated to not waste bytes, and non-cohort input put somewhere in CDATA or a string.) CG-3 knows what a baseform is and what mapping tags are, but which tags are POS or secondary or semantic and so on is simply not part of the model. It could be something people write into their grammars, but even that is messy. So in conclusion, from my point of view, stream validators need to be part of the parsing system they work in, because CG-3 is mostly agnostic. I'm happy to be proven wrong, if someone can come up with a clean way to make it work in general. -- Tino Didriksen On 19 January 2018 at 10:55, Niko Partanen <[email protected]> wrote: > Hi Tino, > > I was asking about this last week in IWCLUL 2018 conference, and was > adviced to contact you. I add Trond, Francis and Tommi here into cc as I > was discussing with them. > > My question was whether there is any obvious well documented way to parse > vislcg3 output, or validate that everything is in order with it. I found > some documentation of the format here: > > http://visl.sdu.dk/cg3/chunked/streamformats.html#stream-vislcg > > I see there are various output formats with cg-conv, but some of those are > giving warnings about information being lost. So I assume parsing the > default output is the best option to go. Project I'm involved with is > currently working with CG rules for Komi-Zyrian, so we would be interested > to analyse bit better the output and how it changes. > > I've often been using Francis's ud-scripts to convert output into CoNLL-U, > which works fine, but this also demands that the output is disambigued. > > https://github.com/ftyers/ud-scripts > > On the other hand the format seems simple and it is clear parsing it with > any programming language is not that hard. Everyone says they have just > come up with some of their own methods, but then there are quite many > corner cases with the way output varies, so reinventing how to parse this > format again seems a bit unnecessary. I would normally work further with > the results in R and Python, so getting the output without information loss > into any of these would do. Also having the output in XML or JSON could be > an easy way to get onward from there. > > Just to give an illustration of a random problem, at times additional tags > get marked like this in Komi-Zyrian output: > > "<карын>" > "кар" Hom1 N Sg Ine @HNOUN #1->0 > > Now the homonymy tag is the first, so the script that assumes pos-tag to > be first will fail (i.e. in ud-annotatrix etc.). Of course the problem may > be in Komi analysator and the output should not look like this to start > with, so maybe there are some tools to validate that the output is not > breaking any rules? > > I was told you maybe would be able to help or advice with this issue. All > help is very much appreciated! The Komi disambiguation is working pretty > nicely now, so I would be very interested to work further with the results. > > Best wishes, > > Niko Partanen > -- You received this message because you are subscribed to the Google Groups "Constraint Grammar" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/constraint-grammar. For more options, visit https://groups.google.com/d/optout.
