Re: [Apertium-stuff] "Interface for creating tagged corpora" GSOC 13

Gema Ramírez-Sánchez Mon, 22 Apr 2013 06:39:06 -0700

Hi Oscar,

thanks for the changes, see my comments inline:



On Thu, Apr 18, 2013 at 5:37 PM, oscar ramirez <[email protected]> wrote:

> Hello Gema,
>
> Sorry about the delay of my response, but I have been busy with an exam in
> the university, I have check your thought and I have made some changes on
> the UIs, you can check the new UIs and functionalities on the wiki 
> entry<http://wiki.apertium.org/wiki/User:Tuxskar/Application_for_%22Interface_for_creating_tagged_corpora%22_GSOC_2013#On_the_GSOC_period>but
>  mainly are here:
>
>
>    - I have added a new text view to the TSX file UI to insert directly
>    new categories, forbid and enforce rules to this file, and also it still
>    has the information of the actual TSX file
>
>
Good.


>
>    - About the insert corpus file I have add the feature to select a dump
>    wikipedia file, a simple text corpus or a corpus already tagged (maybe we
>    could just have 2 FileBrowers just seen the file extension). Also I have
>    use some wikipedia dump files to generate a tagged corpus and we should
>    think about it, because it takes some time to transform from a compress
>    file (tar.gz or bz) to a tagged corpus, maybe adding a progress bar still
>    the transformation (decompression and creation of corpus) is finished
>
> It will take time, no problem. If the user knows it, is ok. It is a useful
feature and probably we can pre-download corpora for Apertium stable
languages and have them available at the wiki or at every language package.
We just need a 30,000 words corpus, so, we will see which is the best
option. The idea is to encourage users to help with disambiguation so,
let's automatise as much as possible every step in the way.


>
>    - I already have change the corpus tagger UI as you mention before,
>    and about the short-cuts there is no problem to add them :)
>
>
Perfect! Mikel told me about a much more user-frienly representation for
tags to make user feel comfortable with the information, we will have to
think about it during the interface design.

>
>    - About the multi-word I thought to select the proper word (the
>    multi-word one or the simple one) and then update the language dictionary
>    deleting the old word for the new one, but I think there is no way to
>    manage the dictionary easily because all the dictionary are really big and
>    could be really slow open the file, read the entry and modify it. I'm still
>    thinking on it
>
> Yes, that would be difficult and the situation will be often the opposite:
we will have to modify the tagged text, not the dictionary.


>
>    - About the schedule, I have changed it but now could be harder to
>    have the first deliverable enough finished (it will be programmed but small
>    tested and documented), I have change the work load a bit and also the
>    other tasks to fix them to the new schedule :D
>
> I think is ok.


>
>    -
>
> Note: About the document you linked me on the email, could you send me the
> diagram or upload it somewhere that should be on the 70th page of the
> document?, it seems has a latex error and it seems been cut on the bottom
> part and I think it is very useful :D
>
>
Weird, I cannot see it either and pdflatex is not generating the image. I'm
attaching a png version for you.


> Regard and thank you, let me know if you find a solution to manage
> multi-words or if you think it is viable to change the dictionaries
>

Well, I'll go for a solution to modify the corpus but I still have to ask
and think about it.


Best,

Gema.



>
>
> On Mon, Apr 15, 2013 at 12:58 PM, Gema Ramírez-Sánchez <[email protected]
> > wrote:
>
>> Sorry, reply to all...
>>
>>
>> ---------- Forwarded message ----------
>> From: Gema Ramírez-Sánchez <[email protected]>
>> Date: Mon, Apr 15, 2013 at 12:56 PM
>> Subject: Re: [Apertium-stuff] "Interface for creating tagged corpora"
>> GSOC 13
>> To: oscar ramirez <[email protected]>
>>
>>
>> Thank you Oscar,
>>
>> my thoughts:
>>
>> - interface for uploading a *Input corpus file UI*:
>>
>>      - it should have an option for:
>>
>>             ** uploading a tagged corpus
>>             ** uploading a non-tagged corpus and tagging it provided that
>> a PoS tagger for that language exists
>>             ** compiling and tagging a corpus for a given language from
>> the wikipedia as explained in
>> http://wiki.apertium.org/wiki/Tagger_training#Creating_a_corpus
>>
>>
>> - interface for *supervised corpus tagger UI*: : I like the first
>> version, the second one can be very good for developers and people used to
>> Apertium but a little bit scary for newcomers. Another possibility could be
>> something like:
>>
>> ----------------------------------------------------------
>> Cierran el orfanato en el que se inspiró Lenon *para*/O1-para*<pr>*/
>> O2-parar*<vblex><pri><p3><sg>*/O3-para*r<vblex><imp><p2><sg>*  escribir
>> Strawberry Fields.
>>
>> NEXT AMBIGUOUS WORD  [CTRL+N]
>> NEXT WORD [→]
>> PREVIOUS AMBIGUOUS WORD  [CRTL+P]
>> PREVIOUS WORD [←]
>> --------------------------------------------------------
>>
>>     where
>>
>>        **you maintain readability but you can also examine non ambiguous
>> words (so that you see the context) and previous disambiguated words (just
>> in case you want to check).
>>
>>        **you can use a keyboard or the mouse (think that it will be a
>> repetitive work, so, let's the user decide)
>>
>> - I see your effort on figuring out how to tackle the "*user-friendly
>> interface to train a supervised tagger*" task (we still have to write
>> the HOWTO in the wiki). The training process is quite similar to the
>> unsupervised that you already performed, so, the interface should enable
>> uploading the necessary files or creating them (see required files at
>> http://manpages.ubuntu.com/manpages/hardy/man1/apertium-tagger.1.html)
>> and launching the training for  a given language and generate a .prob.
>>
>> - the *TSX management file UI*: it can be good approach to see and edit
>> a TSX. We will have to think about how to address enforce, forbid and
>> prefer rules (look at . BTW, these rules are applyed to the .prob once it
>> has been created, we have to remember about it as an option for the
>> training interface).
>>
>> I think that, for the previous 2 tasks it might be useful to take a look
>> at sections: 3.2 Part-of-speech tagger and 5.4 Adding data for the lexical
>> categorial disambiguator (part-of-speech tagger) of the Apertium
>> documentation: http://xixona.dlsi.ua.es/~fran/apertium2-documentation.pdf
>>
>> - *Constraint grammar rule manager UI*: I'm not very familiar to this.
>> Maybe Fran or someone else could help.
>>
>> - For the last task *A way to take into account automatically new
>> multiwords / different tokenisation *let me try to explain the problem:
>> imagine that we had two entries in our dictionaries, one for 'straberry',
>> one for 'field' and a hand tagged corpora containing "Straberry Fields".
>> The tagged version would contain two separated tagged words:
>> straberry<n><sg> and field<n><pl>. Then someone decides to add 'Straberry
>> Fields' to the dictionary as a multiple-word unit representing a proper
>> noun. When training a supervised tagger the training process will tell us:
>> dictionary output 'Straberry Fields" and corpora ('Straberry' and 'Fields'
>> differ), please correct this inconsistency. So, we have to go and modify
>> the tagged corpora to have this multiple word unit so that the training
>> process can go on.
>>
>> That said, I don't quite see how we could automatise the addition of
>> muliwords...
>>
>> - About the schedule: I would redo the schedule as I think that there is
>> a more logic path which will help you to go to the following task: UI to
>> train a supervised tagger (to have the whole picture), UI to manually
>> dissabiguate tagged corpora, UIs for TSX and constraint grammar management
>> and UI to test .prob performance.
>>
>>
>> Again, thank you for the effort!
>>
>> Gema.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Apr 15, 2013 at 12:35 AM, oscar ramirez <[email protected]>wrote:
>>
>>> Hello,
>>>
>>> I have ready my application for the idea "Interface for creating tagged
>>> corpora" for gsoc 2013, and I have wrote a wiki page with all the
>>> interfaces I have already implement (as mockup), the timetable for gsoc and
>>> before it and my bio
>>>
>>>
>>> http://wiki.apertium.org/wiki/User:Tuxskar/Application_for_%22Interface_for_creating_tagged_corpora%22_GSOC_2013#On_the_GSOC_period
>>>
>>> Gema, Mikel and Fran I think we can discuss about it when you can :D
>>>
>>>
>>> On Sat, Apr 13, 2013 at 2:37 AM, oscar ramirez <[email protected]>wrote:
>>>
>>>> Hello,
>>>>
>>>> As Tino said, the problem associated with PCRE is actually a problem
>>>> with the regexp-trules.txt file, I thought that maybe the problem would be
>>>> because of I was using the en-es pair languages, and today I have tried
>>>> again to follow the guide but using this time the es-ca pair, and I got the
>>>> same error.
>>>>
>>>> It seems is a problem with the command:
>>>> $ apertium-xtract-regex-trules trules.xml > regexp-trules.txt
>>>>
>>>> This is the guide I'm following is this
>>>> http://wiki.apertium.org/wiki/Target-language_tagger_training
>>>>
>>>> After use this command it works just if you escape the character #, and
>>>> also using just the t1x file and it seems works, I don't know If escape
>>>> manually that character is normal or if I'm doing something wrong
>>>>
>>>> Gema, about the mockups I have been talking with Fran about some doubts
>>>> I had and I have the early versions of the input file UI, corpus tagger UI,
>>>> performance measure .prob UI and the TXS files manager UI.
>>>>
>>>> Instead of do them with a mockup tool, I have implemented a dummy
>>>> interfaces with glade and launched with python because I think could be
>>>> better have them as we will have them at the end
>>>>
>>>> The first one is really simple it is the *Input corpus file UI*:
>>>>
>>>> http://imgur.com/PiZgmwa
>>>>
>>>> How it works:
>>>>
>>>>    - You are able to select a corpus file
>>>>    - Modify (or directly enter a new corpus from scratch) the actual
>>>>    corpus
>>>>    - Once you finish your edition you click on the apply button and it
>>>>    shows the corpus tagger UI
>>>>
>>>>
>>>> Here you can see the 2 interfaces versions to manage the manual
>>>> supervision *corpus tagger UI*:
>>>>
>>>>    - First version you have the corpus and go word by word checking
>>>>    the correct word depending of the properties:
>>>>    http://imgur.com/hOXb9WK
>>>>
>>>>
>>>>    - Second version (version proposed by Fran) you have just the
>>>>    corpus tagged to be disambiguated, highlighed the ambiguous words and 
>>>> you
>>>>    choose the correct one by click on them, and let in the mouse hover the
>>>>    word you have more info about the preferences:
>>>>    http://imgur.com/5G2mqOl
>>>>
>>>> The difference between them is that the first lets you introduce a
>>>> corpus file and modify it in the text view using the buttons, and the
>>>> second one let you change the ambigueti on the textview
>>>>
>>>> Both design have 2 buttons, one to finish the supervision and the other
>>>> to lunch the performance mesure .prob files
>>>>
>>>> For the *performance measure* .prop file UI I have this in mind:
>>>> http://imgur.com/RnSiZ7G
>>>>
>>>> How it works:
>>>>
>>>>    - First of all we introduce the tagged corpus base in the left text
>>>>    view (if you came from the previous window you already have it inserted)
>>>>    - Now you choose your .prob file
>>>>    - Some how (maybe by selecting or automaticatly) some part of the
>>>>    tagged corpus is selected to train the .prob file
>>>>    - Once all is set up you click on the "measure" button to measure
>>>>    the performance
>>>>    - When the measure is done, on the bottom left you get the accuracy
>>>>    and the information about the performance
>>>>    - After the measure it shows as well the output generated using the
>>>>    .prob file to see where the system has failed tagging
>>>>
>>>> For the TSX management file I have the next idea (this is still in
>>>> dicussion with Fran as well because maybe it isn't as usefull as we
>>>> expected):
>>>>
>>>> http://imgur.com/WZFEZLP
>>>>
>>>> How it works:
>>>>
>>>>    - First you choose a TSX file
>>>>    - it shows all the tags on it, with the name and every item
>>>>    - On the second column it says if it is closed or not
>>>>    - Also you are able to add new labels and items (the items before
>>>>    you have to select the parent label)
>>>>    - Once you finish the edition you click on save button
>>>>
>>>> I still have to refine with Fran the Grammar contraint interface and
>>>> how to manage the multi word, I expect get them before Monday, but I think
>>>> we can start discuss with this interfaces and when I get the others follow
>>>> with the others
>>>>
>>>> Regards, and sorry about the length of the email :P
>>>>
>>>
>>>
>>
>>
>> --
>> Gema Ramírez
>> ---------------------
>> Prompsit LE
>> Traduce, extrae, analiza: http://aplica.prompsit.com
>>
>>
>>
>> --
>> Gema Ramírez
>> ---------------------
>> Prompsit LE
>> Traduce, extrae, analiza: http://aplica.prompsit.com
>>
>
>


-- 
Gema Ramírez
---------------------
Prompsit LE
Traduce, extrae, analiza: http://aplica.prompsit.com

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] "Interface for creating tagged corpora" GSOC 13

Reply via email to