[Apertium-stuff] Fwd: "Interface for creating tagged corpora" GSOC 13

Gema Ramírez-Sánchez Mon, 15 Apr 2013 03:58:48 -0700

Sorry, reply to all...

---------- Forwarded message ----------
From: Gema Ramírez-Sánchez <[email protected]>
Date: Mon, Apr 15, 2013 at 12:56 PM
Subject: Re: [Apertium-stuff] "Interface for creating tagged corpora" GSOC
13
To: oscar ramirez <[email protected]>



Thank you Oscar,

my thoughts:

- interface for uploading a *Input corpus file UI*:

     - it should have an option for:

            ** uploading a tagged corpus
            ** uploading a non-tagged corpus and tagging it provided that a
PoS tagger for that language exists
            ** compiling and tagging a corpus for a given language from the
wikipedia as explained in
http://wiki.apertium.org/wiki/Tagger_training#Creating_a_corpus


- interface for *supervised corpus tagger UI*: : I like the first version,
the second one can be very good for developers and people used to Apertium
but a little bit scary for newcomers. Another possibility could be
something like:

----------------------------------------------------------
Cierran el orfanato en el que se inspiró Lenon *para*/O1-para*<pr>*/O2-parar
*<vblex><pri><p3><sg>*/O3-para*r<vblex><imp><p2><sg>*  escribir Strawberry
Fields.

NEXT AMBIGUOUS WORD  [CTRL+N]
NEXT WORD [→]
PREVIOUS AMBIGUOUS WORD  [CRTL+P]
PREVIOUS WORD [←]
--------------------------------------------------------

    where

       **you maintain readability but you can also examine non ambiguous
words (so that you see the context) and previous disambiguated words (just
in case you want to check).

       **you can use a keyboard or the mouse (think that it will be a
repetitive work, so, let's the user decide)

- I see your effort on figuring out how to tackle the "*user-friendly
interface to train a supervised tagger*" task (we still have to write the
HOWTO in the wiki). The training process is quite similar to the
unsupervised that you already performed, so, the interface should enable
uploading the necessary files or creating them (see required files at
http://manpages.ubuntu.com/manpages/hardy/man1/apertium-tagger.1.html) and
launching the training for  a given language and generate a .prob.

- the *TSX management file UI*: it can be good approach to see and edit a
TSX. We will have to think about how to address enforce, forbid and prefer
rules (look at . BTW, these rules are applyed to the .prob once it has been
created, we have to remember about it as an option for the training
interface).

I think that, for the previous 2 tasks it might be useful to take a look at
sections: 3.2 Part-of-speech tagger and 5.4 Adding data for the lexical
categorial disambiguator (part-of-speech tagger) of the Apertium
documentation: http://xixona.dlsi.ua.es/~fran/apertium2-documentation.pdf

- *Constraint grammar rule manager UI*: I'm not very familiar to this.
Maybe Fran or someone else could help.

- For the last task *A way to take into account automatically new
multiwords / different tokenisation *let me try to explain the problem:
imagine that we had two entries in our dictionaries, one for 'straberry',
one for 'field' and a hand tagged corpora containing "Straberry Fields".
The tagged version would contain two separated tagged words:
straberry<n><sg> and field<n><pl>. Then someone decides to add 'Straberry
Fields' to the dictionary as a multiple-word unit representing a proper
noun. When training a supervised tagger the training process will tell us:
dictionary output 'Straberry Fields" and corpora ('Straberry' and 'Fields'
differ), please correct this inconsistency. So, we have to go and modify
the tagged corpora to have this multiple word unit so that the training
process can go on.

That said, I don't quite see how we could automatise the addition of
muliwords...

- About the schedule: I would redo the schedule as I think that there is a
more logic path which will help you to go to the following task: UI to
train a supervised tagger (to have the whole picture), UI to manually
dissabiguate tagged corpora, UIs for TSX and constraint grammar management
and UI to test .prob performance.


Again, thank you for the effort!

Gema.











On Mon, Apr 15, 2013 at 12:35 AM, oscar ramirez <[email protected]> wrote:

> Hello,
>
> I have ready my application for the idea "Interface for creating tagged
> corpora" for gsoc 2013, and I have wrote a wiki page with all the
> interfaces I have already implement (as mockup), the timetable for gsoc and
> before it and my bio
>
>
> http://wiki.apertium.org/wiki/User:Tuxskar/Application_for_%22Interface_for_creating_tagged_corpora%22_GSOC_2013#On_the_GSOC_period
>
> Gema, Mikel and Fran I think we can discuss about it when you can :D
>
>
> On Sat, Apr 13, 2013 at 2:37 AM, oscar ramirez <[email protected]> wrote:
>
>> Hello,
>>
>> As Tino said, the problem associated with PCRE is actually a problem with
>> the regexp-trules.txt file, I thought that maybe the problem would be
>> because of I was using the en-es pair languages, and today I have tried
>> again to follow the guide but using this time the es-ca pair, and I got the
>> same error.
>>
>> It seems is a problem with the command:
>> $ apertium-xtract-regex-trules trules.xml > regexp-trules.txt
>>
>> This is the guide I'm following is this
>> http://wiki.apertium.org/wiki/Target-language_tagger_training
>>
>> After use this command it works just if you escape the character #, and
>> also using just the t1x file and it seems works, I don't know If escape
>> manually that character is normal or if I'm doing something wrong
>>
>> Gema, about the mockups I have been talking with Fran about some doubts I
>> had and I have the early versions of the input file UI, corpus tagger UI,
>> performance measure .prob UI and the TXS files manager UI.
>>
>> Instead of do them with a mockup tool, I have implemented a dummy
>> interfaces with glade and launched with python because I think could be
>> better have them as we will have them at the end
>>
>> The first one is really simple it is the *Input corpus file UI*:
>>
>> http://imgur.com/PiZgmwa
>>
>> How it works:
>>
>>    - You are able to select a corpus file
>>    - Modify (or directly enter a new corpus from scratch) the actual
>>    corpus
>>    - Once you finish your edition you click on the apply button and it
>>    shows the corpus tagger UI
>>
>>
>> Here you can see the 2 interfaces versions to manage the manual
>> supervision *corpus tagger UI*:
>>
>>    - First version you have the corpus and go word by word checking the
>>    correct word depending of the properties:
>>    http://imgur.com/hOXb9WK
>>
>>
>>    - Second version (version proposed by Fran) you have just the corpus
>>    tagged to be disambiguated, highlighed the ambiguous words and you choose
>>    the correct one by click on them, and let in the mouse hover the word you
>>    have more info about the preferences:
>>    http://imgur.com/5G2mqOl
>>
>> The difference between them is that the first lets you introduce a corpus
>> file and modify it in the text view using the buttons, and the second one
>> let you change the ambigueti on the textview
>>
>> Both design have 2 buttons, one to finish the supervision and the other
>> to lunch the performance mesure .prob files
>>
>> For the *performance measure* .prop file UI I have this in mind:
>> http://imgur.com/RnSiZ7G
>>
>> How it works:
>>
>>    - First of all we introduce the tagged corpus base in the left text
>>    view (if you came from the previous window you already have it inserted)
>>    - Now you choose your .prob file
>>    - Some how (maybe by selecting or automaticatly) some part of the
>>    tagged corpus is selected to train the .prob file
>>    - Once all is set up you click on the "measure" button to measure the
>>    performance
>>    - When the measure is done, on the bottom left you get the accuracy
>>    and the information about the performance
>>    - After the measure it shows as well the output generated using the
>>    .prob file to see where the system has failed tagging
>>
>> For the TSX management file I have the next idea (this is still in
>> dicussion with Fran as well because maybe it isn't as usefull as we
>> expected):
>>
>> http://imgur.com/WZFEZLP
>>
>> How it works:
>>
>>    - First you choose a TSX file
>>    - it shows all the tags on it, with the name and every item
>>    - On the second column it says if it is closed or not
>>    - Also you are able to add new labels and items (the items before you
>>    have to select the parent label)
>>    - Once you finish the edition you click on save button
>>
>> I still have to refine with Fran the Grammar contraint interface and how
>> to manage the multi word, I expect get them before Monday, but I think we
>> can start discuss with this interfaces and when I get the others follow
>> with the others
>>
>> Regards, and sorry about the length of the email :P
>>
>
>


-- 
Gema Ramírez
---------------------
Prompsit LE
Traduce, extrae, analiza: http://aplica.prompsit.com



-- 
Gema Ramírez
---------------------
Prompsit LE
Traduce, extrae, analiza: http://aplica.prompsit.com

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] Fwd: "Interface for creating tagged corpora" GSOC 13

Reply via email to