[cg] Re: Recommended way to parse vislcg3 output

Niko Partanen Mon, 22 Jan 2018 04:22:15 -0800

Hi Tino,

Thank you for clarifying reply and forwarding it to the list! I was also
thinking that the tags are probably bit unrelated to my actual question
about the format itself, I see well your point with validation. In my
opinion XML output could be a welcome addition, since everyone in the
research group I work with is familiar with that already. Of course if
there is no wider need for this, then I guess I'll just deal with what is
the output now. I had also assumed there is bit more structure in the tags
from CG-3 point of view than there is, so this also changes where I have to
look into with my problem.


I'm also happy to hear other opinions about this!

Best wishes,

Niko

On Mon, Jan 22, 2018 at 12:34 PM, Tino Didriksen <[email protected]>
wrote:

> (CC'ing the Constraint Grammar mailing list)
>
> So, from my point of view that's a very simple topic, but unfortunately
> not in a way that helps you. CG-3 doesn't care about most things in the
> stream.
>
> As you identified, http://visl.sdu.dk/cg3/chunked/streamformats.
> html#stream-vislcg is the documentation for the stream format, and that's
> really it. I've added stream static tags just now, but they don't alter
> much.
>
> CG-3 mostly does not care what order tags are in or how those tags look.
> As long as there's a baseform first, the remaining tags are a random bag.
> Your example "кар" Hom1 N Sg Ine @HNOUN #1->0 is to CG-3 the same
> as "кар" @HNOUN #1->0 Sg N Ine Hom1, so what you consider validation is far
> beyond of what I consider validation. Each parsing system has their own
> tags and tag order, and CG-3 tries to maintain those tags and order but
> doesn't really care about it.
>
> This is also why there is no CoNLL-U converter directly in CG-3. CoNLL-U
> mandates many tag patterns and orders that CG-3 simply doesn't care about
> or even knows about - I can't make a general-purpose converter, because
> each parsing system wants it differently.
>
> I could quite easily convert to XML or JSON, but how much that would help
> is I think limited. It'd be something like
>
> XML:
> <cohort id="1" parent="0">
> <wordform>...</wordform>
> <static-tags><tag>...</tag><tag>...</tag></static-tags>
> <readings>
> <reading><baseform>...</baseform><tag>...</tag><tag>..
> .</tag><tag>...</tag><mapping>...</mapping></reading>
> <reading><baseform>...</baseform><tag>...</tag><tag>..
> .</tag><mapping>...</mapping></reading>
> </readings>
> </cohort>
>
> JSON:
> {
> wordform: "...",
> static_tags: ["...", "..."],
> readings: [
> {baseform: "...", tags: ["...", "...", "..."], mapping: "..."},
> {baseform: "...", tags: ["...", "..."], mapping: "..."},
> ]
> }
>
> (With everything abbreviated to not waste bytes, and non-cohort input put
> somewhere in CDATA or a string.)
>
> CG-3 knows what a baseform is and what mapping tags are, but which tags
> are POS or secondary or semantic and so on is simply not part of the model.
> It could be something people write into their grammars, but even that is
> messy.
>
> So in conclusion, from my point of view, stream validators need to be part
> of the parsing system they work in, because CG-3 is mostly agnostic. I'm
> happy to be proven wrong, if someone can come up with a clean way to make
> it work in general.
>
> -- Tino Didriksen
>
>
> On 19 January 2018 at 10:55, Niko Partanen <[email protected]>
> wrote:
>
>> Hi Tino,
>>
>> I was asking about this last week in IWCLUL 2018 conference, and was
>> adviced to contact you. I add Trond, Francis and Tommi here into cc as I
>> was discussing with them.
>>
>> My question was whether there is any obvious well documented way to parse
>> vislcg3 output, or validate that everything is in order with it. I found
>> some documentation of the format here:
>>
>> http://visl.sdu.dk/cg3/chunked/streamformats.html#stream-vislcg
>>
>> I see there are various output formats with cg-conv, but some of those
>> are giving warnings about information being lost. So I assume parsing the
>> default output is the best option to go. Project I'm involved with is
>> currently working with CG rules for Komi-Zyrian, so we would be interested
>> to analyse bit better the output and how it changes.
>>
>> I've often been using Francis's ud-scripts to convert output into
>> CoNLL-U, which works fine, but this also demands that the output is
>> disambigued.
>>
>> https://github.com/ftyers/ud-scripts
>>
>> On the other hand the format seems simple and it is clear parsing it with
>> any programming language is not that hard. Everyone says they have just
>> come up with some of their own methods, but then there are quite many
>> corner cases with the way output varies, so reinventing how to parse this
>> format again seems a bit unnecessary. I would normally work further with
>> the results in R and Python, so getting the output without information loss
>> into any of these would do. Also having the output in XML or JSON could be
>> an easy way to get onward from there.
>>
>> Just to give an illustration of a random problem, at times additional
>> tags get marked like this in Komi-Zyrian output:
>>
>> "<карын>"
>> "кар" Hom1 N Sg Ine @HNOUN #1->0
>>
>> Now the homonymy tag is the first, so the script that assumes pos-tag to
>> be first will fail (i.e. in ud-annotatrix etc.). Of course the problem may
>> be in Komi analysator and the output should not look like this to start
>> with, so maybe there are some tools to validate that the output is not
>> breaking any rules?
>>
>> I was told you maybe would be able to help or advice with this issue. All
>> help is very much appreciated! The Komi disambiguation is working pretty
>> nicely now, so I would be very interested to work further with the results.
>>
>> Best wishes,
>>
>> Niko Partanen
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Constraint Grammar" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/constraint-grammar.
For more options, visit https://groups.google.com/d/optout.

[cg] Re: Recommended way to parse vislcg3 output

Reply via email to