[cg] Re: Recommended way to parse vislcg3 output

Tino Didriksen Mon, 22 Jan 2018 03:35:55 -0800

(CC'ing the Constraint Grammar mailing list)

So, from my point of view that's a very simple topic, but unfortunately not
in a way that helps you. CG-3 doesn't care about most things in the stream.

As you identified,
http://visl.sdu.dk/cg3/chunked/streamformats.html#stream-vislcg is the
documentation for the stream format, and that's really it. I've added
stream static tags just now, but they don't alter much.

CG-3 mostly does not care what order tags are in or how those tags look. As
long as there's a baseform first, the remaining tags are a random bag. Your
example "кар" Hom1 N Sg Ine @HNOUN #1->0 is to CG-3 the same
as "кар" @HNOUN #1->0 Sg N Ine Hom1, so what you consider validation is far
beyond of what I consider validation. Each parsing system has their own
tags and tag order, and CG-3 tries to maintain those tags and order but
doesn't really care about it.

This is also why there is no CoNLL-U converter directly in CG-3. CoNLL-U
mandates many tag patterns and orders that CG-3 simply doesn't care about
or even knows about - I can't make a general-purpose converter, because
each parsing system wants it differently.

I could quite easily convert to XML or JSON, but how much that would help
is I think limited. It'd be something like

XML:
<cohort id="1" parent="0">
<wordform>...</wordform>
<static-tags><tag>...</tag><tag>...</tag></static-tags>
<readings>
<reading><baseform>...</baseform><tag>...</tag><tag>...</tag><tag>...</tag><mapping>...</mapping></reading>
<reading><baseform>...</baseform><tag>...</tag><tag>...</tag><mapping>...</mapping></reading>
</readings>
</cohort>

JSON:
{
wordform: "...",
static_tags: ["...", "..."],
readings: [
{baseform: "...", tags: ["...", "...", "..."], mapping: "..."},
{baseform: "...", tags: ["...", "..."], mapping: "..."},
]
}

(With everything abbreviated to not waste bytes, and non-cohort input put
somewhere in CDATA or a string.)

CG-3 knows what a baseform is and what mapping tags are, but which tags are
POS or secondary or semantic and so on is simply not part of the model. It
could be something people write into their grammars, but even that is messy.

So in conclusion, from my point of view, stream validators need to be part
of the parsing system they work in, because CG-3 is mostly agnostic. I'm
happy to be proven wrong, if someone can come up with a clean way to make
it work in general.

-- Tino Didriksen

On 19 January 2018 at 10:55, Niko Partanen <[email protected]>
wrote:

> Hi Tino,
>
> I was asking about this last week in IWCLUL 2018 conference, and was
> adviced to contact you. I add Trond, Francis and Tommi here into cc as I
> was discussing with them.
>
> My question was whether there is any obvious well documented way to parse
> vislcg3 output, or validate that everything is in order with it. I found
> some documentation of the format here:
>
> http://visl.sdu.dk/cg3/chunked/streamformats.html#stream-vislcg
>
> I see there are various output formats with cg-conv, but some of those are
> giving warnings about information being lost. So I assume parsing the
> default output is the best option to go. Project I'm involved with is
> currently working with CG rules for Komi-Zyrian, so we would be interested
> to analyse bit better the output and how it changes.
>
> I've often been using Francis's ud-scripts to convert output into CoNLL-U,
> which works fine, but this also demands that the output is disambigued.
>
> https://github.com/ftyers/ud-scripts
>
> On the other hand the format seems simple and it is clear parsing it with
> any programming language is not that hard. Everyone says they have just
> come up with some of their own methods, but then there are quite many
> corner cases with the way output varies, so reinventing how to parse this
> format again seems a bit unnecessary. I would normally work further with
> the results in R and Python, so getting the output without information loss
> into any of these would do. Also having the output in XML or JSON could be
> an easy way to get onward from there.
>
> Just to give an illustration of a random problem, at times additional tags
> get marked like this in Komi-Zyrian output:
>
> "<карын>"
> "кар" Hom1 N Sg Ine @HNOUN #1->0
>
> Now the homonymy tag is the first, so the script that assumes pos-tag to
> be first will fail (i.e. in ud-annotatrix etc.). Of course the problem may
> be in Komi analysator and the output should not look like this to start
> with, so maybe there are some tools to validate that the output is not
> breaking any rules?
>
> I was told you maybe would be able to help or advice with this issue. All
> help is very much appreciated! The Komi disambiguation is working pretty
> nicely now, so I would be very interested to work further with the results.
>
> Best wishes,
>
> Niko Partanen
>

-- 
You received this message because you are subscribed to the Google Groups 
"Constraint Grammar" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/constraint-grammar.
For more options, visit https://groups.google.com/d/optout.

[cg] Re: Recommended way to parse vislcg3 output

Reply via email to