Hi Tino, Thank you for clarifying reply and forwarding it to the list! I was also thinking that the tags are probably bit unrelated to my actual question about the format itself, I see well your point with validation. In my opinion XML output could be a welcome addition, since everyone in the research group I work with is familiar with that already. Of course if there is no wider need for this, then I guess I'll just deal with what is the output now. I had also assumed there is bit more structure in the tags from CG-3 point of view than there is, so this also changes where I have to look into with my problem.
I'm also happy to hear other opinions about this! Best wishes, Niko On Mon, Jan 22, 2018 at 12:34 PM, Tino Didriksen <[email protected]> wrote: > (CC'ing the Constraint Grammar mailing list) > > So, from my point of view that's a very simple topic, but unfortunately > not in a way that helps you. CG-3 doesn't care about most things in the > stream. > > As you identified, http://visl.sdu.dk/cg3/chunked/streamformats. > html#stream-vislcg is the documentation for the stream format, and that's > really it. I've added stream static tags just now, but they don't alter > much. > > CG-3 mostly does not care what order tags are in or how those tags look. > As long as there's a baseform first, the remaining tags are a random bag. > Your example "кар" Hom1 N Sg Ine @HNOUN #1->0 is to CG-3 the same > as "кар" @HNOUN #1->0 Sg N Ine Hom1, so what you consider validation is far > beyond of what I consider validation. Each parsing system has their own > tags and tag order, and CG-3 tries to maintain those tags and order but > doesn't really care about it. > > This is also why there is no CoNLL-U converter directly in CG-3. CoNLL-U > mandates many tag patterns and orders that CG-3 simply doesn't care about > or even knows about - I can't make a general-purpose converter, because > each parsing system wants it differently. > > I could quite easily convert to XML or JSON, but how much that would help > is I think limited. It'd be something like > > XML: > <cohort id="1" parent="0"> > <wordform>...</wordform> > <static-tags><tag>...</tag><tag>...</tag></static-tags> > <readings> > <reading><baseform>...</baseform><tag>...</tag><tag>.. > .</tag><tag>...</tag><mapping>...</mapping></reading> > <reading><baseform>...</baseform><tag>...</tag><tag>.. > .</tag><mapping>...</mapping></reading> > </readings> > </cohort> > > JSON: > { > wordform: "...", > static_tags: ["...", "..."], > readings: [ > {baseform: "...", tags: ["...", "...", "..."], mapping: "..."}, > {baseform: "...", tags: ["...", "..."], mapping: "..."}, > ] > } > > (With everything abbreviated to not waste bytes, and non-cohort input put > somewhere in CDATA or a string.) > > CG-3 knows what a baseform is and what mapping tags are, but which tags > are POS or secondary or semantic and so on is simply not part of the model. > It could be something people write into their grammars, but even that is > messy. > > So in conclusion, from my point of view, stream validators need to be part > of the parsing system they work in, because CG-3 is mostly agnostic. I'm > happy to be proven wrong, if someone can come up with a clean way to make > it work in general. > > -- Tino Didriksen > > > On 19 January 2018 at 10:55, Niko Partanen <[email protected]> > wrote: > >> Hi Tino, >> >> I was asking about this last week in IWCLUL 2018 conference, and was >> adviced to contact you. I add Trond, Francis and Tommi here into cc as I >> was discussing with them. >> >> My question was whether there is any obvious well documented way to parse >> vislcg3 output, or validate that everything is in order with it. I found >> some documentation of the format here: >> >> http://visl.sdu.dk/cg3/chunked/streamformats.html#stream-vislcg >> >> I see there are various output formats with cg-conv, but some of those >> are giving warnings about information being lost. So I assume parsing the >> default output is the best option to go. Project I'm involved with is >> currently working with CG rules for Komi-Zyrian, so we would be interested >> to analyse bit better the output and how it changes. >> >> I've often been using Francis's ud-scripts to convert output into >> CoNLL-U, which works fine, but this also demands that the output is >> disambigued. >> >> https://github.com/ftyers/ud-scripts >> >> On the other hand the format seems simple and it is clear parsing it with >> any programming language is not that hard. Everyone says they have just >> come up with some of their own methods, but then there are quite many >> corner cases with the way output varies, so reinventing how to parse this >> format again seems a bit unnecessary. I would normally work further with >> the results in R and Python, so getting the output without information loss >> into any of these would do. Also having the output in XML or JSON could be >> an easy way to get onward from there. >> >> Just to give an illustration of a random problem, at times additional >> tags get marked like this in Komi-Zyrian output: >> >> "<карын>" >> "кар" Hom1 N Sg Ine @HNOUN #1->0 >> >> Now the homonymy tag is the first, so the script that assumes pos-tag to >> be first will fail (i.e. in ud-annotatrix etc.). Of course the problem may >> be in Komi analysator and the output should not look like this to start >> with, so maybe there are some tools to validate that the output is not >> breaking any rules? >> >> I was told you maybe would be able to help or advice with this issue. All >> help is very much appreciated! The Komi disambiguation is working pretty >> nicely now, so I would be very interested to work further with the results. >> >> Best wishes, >> >> Niko Partanen >> > > -- You received this message because you are subscribed to the Google Groups "Constraint Grammar" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/constraint-grammar. For more options, visit https://groups.google.com/d/optout.
