Re: [Apertium-stuff] "Interface for creating tagged corpora" GSOC 13

oscar ramirez Thu, 18 Apr 2013 09:38:12 -0700

Hello Gema,

Sorry about the delay of my response, but I have been busy with an exam in
the university, I have check your thought and I have made some changes on
the UIs, you can check the new UIs and functionalities on the wiki
entry<http://wiki.apertium.org/wiki/User:Tuxskar/Application_for_%22Interface_for_creating_tagged_corpora%22_GSOC_2013#On_the_GSOC_period>but
mainly are here:



   - I have added a new text view to the TSX file UI to insert directly new
   categories, forbid and enforce rules to this file, and also it still has
   the information of the actual TSX file
   - About the insert corpus file I have add the feature to select a dump
   wikipedia file, a simple text corpus or a corpus already tagged (maybe we
   could just have 2 FileBrowers just seen the file extension). Also I have
   use some wikipedia dump files to generate a tagged corpus and we should
   think about it, because it takes some time to transform from a compress
   file (tar.gz or bz) to a tagged corpus, maybe adding a progress bar still
   the transformation (decompression and creation of corpus) is finished.
   - I already have change the corpus tagger UI as you mention before, and
   about the short-cuts there is no problem to add them :)
   - About the multi-word I thought to select the proper word (the
   multi-word one or the simple one) and then update the language dictionary
   deleting the old word for the new one, but I think there is no way to
   manage the dictionary easily because all the dictionary are really big and
   could be really slow open the file, read the entry and modify it. I'm still
   thinking on it
   - About the schedule, I have changed it but now could be harder to have
   the first deliverable enough finished (it will be programmed but small
   tested and documented), I have change the work load a bit and also the
   other tasks to fix them to the new schedule :D

Note: About the document you linked me on the email, could you send me the
diagram or upload it somewhere that should be on the 70th page of the
document?, it seems has a latex error and it seems been cut on the bottom
part and I think it is very useful :D

Regard and thank you, let me know if you find a solution to manage
multi-words or if you think it is viable to change the dictionaries


On Mon, Apr 15, 2013 at 12:58 PM, Gema Ramírez-Sánchez
<[email protected]>wrote:

> Sorry, reply to all...
>
>
> ---------- Forwarded message ----------
> From: Gema Ramírez-Sánchez <[email protected]>
> Date: Mon, Apr 15, 2013 at 12:56 PM
> Subject: Re: [Apertium-stuff] "Interface for creating tagged corpora" GSOC
> 13
> To: oscar ramirez <[email protected]>
>
>
> Thank you Oscar,
>
> my thoughts:
>
> - interface for uploading a *Input corpus file UI*:
>
>      - it should have an option for:
>
>             ** uploading a tagged corpus
>             ** uploading a non-tagged corpus and tagging it provided that
> a PoS tagger for that language exists
>             ** compiling and tagging a corpus for a given language from
> the wikipedia as explained in
> http://wiki.apertium.org/wiki/Tagger_training#Creating_a_corpus
>
>
> - interface for *supervised corpus tagger UI*: : I like the first
> version, the second one can be very good for developers and people used to
> Apertium but a little bit scary for newcomers. Another possibility could be
> something like:
>
> ----------------------------------------------------------
> Cierran el orfanato en el que se inspiró Lenon *para*/O1-para*<pr>*/
> O2-parar*<vblex><pri><p3><sg>*/O3-para*r<vblex><imp><p2><sg>*  escribir
> Strawberry Fields.
>
> NEXT AMBIGUOUS WORD  [CTRL+N]
> NEXT WORD [→]
> PREVIOUS AMBIGUOUS WORD  [CRTL+P]
> PREVIOUS WORD [←]
> --------------------------------------------------------
>
>     where
>
>        **you maintain readability but you can also examine non ambiguous
> words (so that you see the context) and previous disambiguated words (just
> in case you want to check).
>
>        **you can use a keyboard or the mouse (think that it will be a
> repetitive work, so, let's the user decide)
>
> - I see your effort on figuring out how to tackle the "*user-friendly
> interface to train a supervised tagger*" task (we still have to write the
> HOWTO in the wiki). The training process is quite similar to the
> unsupervised that you already performed, so, the interface should enable
> uploading the necessary files or creating them (see required files at
> http://manpages.ubuntu.com/manpages/hardy/man1/apertium-tagger.1.html)
> and launching the training for  a given language and generate a .prob.
>
> - the *TSX management file UI*: it can be good approach to see and edit a
> TSX. We will have to think about how to address enforce, forbid and prefer
> rules (look at . BTW, these rules are applyed to the .prob once it has been
> created, we have to remember about it as an option for the training
> interface).
>
> I think that, for the previous 2 tasks it might be useful to take a look
> at sections: 3.2 Part-of-speech tagger and 5.4 Adding data for the lexical
> categorial disambiguator (part-of-speech tagger) of the Apertium
> documentation: http://xixona.dlsi.ua.es/~fran/apertium2-documentation.pdf
>
> - *Constraint grammar rule manager UI*: I'm not very familiar to this.
> Maybe Fran or someone else could help.
>
> - For the last task *A way to take into account automatically new
> multiwords / different tokenisation *let me try to explain the problem:
> imagine that we had two entries in our dictionaries, one for 'straberry',
> one for 'field' and a hand tagged corpora containing "Straberry Fields".
> The tagged version would contain two separated tagged words:
> straberry<n><sg> and field<n><pl>. Then someone decides to add 'Straberry
> Fields' to the dictionary as a multiple-word unit representing a proper
> noun. When training a supervised tagger the training process will tell us:
> dictionary output 'Straberry Fields" and corpora ('Straberry' and 'Fields'
> differ), please correct this inconsistency. So, we have to go and modify
> the tagged corpora to have this multiple word unit so that the training
> process can go on.
>
> That said, I don't quite see how we could automatise the addition of
> muliwords...
>
> - About the schedule: I would redo the schedule as I think that there is a
> more logic path which will help you to go to the following task: UI to
> train a supervised tagger (to have the whole picture), UI to manually
> dissabiguate tagged corpora, UIs for TSX and constraint grammar management
> and UI to test .prob performance.
>
>
> Again, thank you for the effort!
>
> Gema.
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Apr 15, 2013 at 12:35 AM, oscar ramirez <[email protected]> wrote:
>
>> Hello,
>>
>> I have ready my application for the idea "Interface for creating tagged
>> corpora" for gsoc 2013, and I have wrote a wiki page with all the
>> interfaces I have already implement (as mockup), the timetable for gsoc and
>> before it and my bio
>>
>>
>> http://wiki.apertium.org/wiki/User:Tuxskar/Application_for_%22Interface_for_creating_tagged_corpora%22_GSOC_2013#On_the_GSOC_period
>>
>> Gema, Mikel and Fran I think we can discuss about it when you can :D
>>
>>
>> On Sat, Apr 13, 2013 at 2:37 AM, oscar ramirez <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> As Tino said, the problem associated with PCRE is actually a problem
>>> with the regexp-trules.txt file, I thought that maybe the problem would be
>>> because of I was using the en-es pair languages, and today I have tried
>>> again to follow the guide but using this time the es-ca pair, and I got the
>>> same error.
>>>
>>> It seems is a problem with the command:
>>> $ apertium-xtract-regex-trules trules.xml > regexp-trules.txt
>>>
>>> This is the guide I'm following is this
>>> http://wiki.apertium.org/wiki/Target-language_tagger_training
>>>
>>> After use this command it works just if you escape the character #, and
>>> also using just the t1x file and it seems works, I don't know If escape
>>> manually that character is normal or if I'm doing something wrong
>>>
>>> Gema, about the mockups I have been talking with Fran about some doubts
>>> I had and I have the early versions of the input file UI, corpus tagger UI,
>>> performance measure .prob UI and the TXS files manager UI.
>>>
>>> Instead of do them with a mockup tool, I have implemented a dummy
>>> interfaces with glade and launched with python because I think could be
>>> better have them as we will have them at the end
>>>
>>> The first one is really simple it is the *Input corpus file UI*:
>>>
>>> http://imgur.com/PiZgmwa
>>>
>>> How it works:
>>>
>>>    - You are able to select a corpus file
>>>    - Modify (or directly enter a new corpus from scratch) the actual
>>>    corpus
>>>    - Once you finish your edition you click on the apply button and it
>>>    shows the corpus tagger UI
>>>
>>>
>>> Here you can see the 2 interfaces versions to manage the manual
>>> supervision *corpus tagger UI*:
>>>
>>>    - First version you have the corpus and go word by word checking the
>>>    correct word depending of the properties:
>>>    http://imgur.com/hOXb9WK
>>>
>>>
>>>    - Second version (version proposed by Fran) you have just the corpus
>>>    tagged to be disambiguated, highlighed the ambiguous words and you choose
>>>    the correct one by click on them, and let in the mouse hover the word you
>>>    have more info about the preferences:
>>>    http://imgur.com/5G2mqOl
>>>
>>> The difference between them is that the first lets you introduce a
>>> corpus file and modify it in the text view using the buttons, and the
>>> second one let you change the ambigueti on the textview
>>>
>>> Both design have 2 buttons, one to finish the supervision and the other
>>> to lunch the performance mesure .prob files
>>>
>>> For the *performance measure* .prop file UI I have this in mind:
>>> http://imgur.com/RnSiZ7G
>>>
>>> How it works:
>>>
>>>    - First of all we introduce the tagged corpus base in the left text
>>>    view (if you came from the previous window you already have it inserted)
>>>    - Now you choose your .prob file
>>>    - Some how (maybe by selecting or automaticatly) some part of the
>>>    tagged corpus is selected to train the .prob file
>>>    - Once all is set up you click on the "measure" button to measure
>>>    the performance
>>>    - When the measure is done, on the bottom left you get the accuracy
>>>    and the information about the performance
>>>    - After the measure it shows as well the output generated using the
>>>    .prob file to see where the system has failed tagging
>>>
>>> For the TSX management file I have the next idea (this is still in
>>> dicussion with Fran as well because maybe it isn't as usefull as we
>>> expected):
>>>
>>> http://imgur.com/WZFEZLP
>>>
>>> How it works:
>>>
>>>    - First you choose a TSX file
>>>    - it shows all the tags on it, with the name and every item
>>>    - On the second column it says if it is closed or not
>>>    - Also you are able to add new labels and items (the items before
>>>    you have to select the parent label)
>>>    - Once you finish the edition you click on save button
>>>
>>> I still have to refine with Fran the Grammar contraint interface and how
>>> to manage the multi word, I expect get them before Monday, but I think we
>>> can start discuss with this interfaces and when I get the others follow
>>> with the others
>>>
>>> Regards, and sorry about the length of the email :P
>>>
>>
>>
>
>
> --
> Gema Ramírez
> ---------------------
> Prompsit LE
> Traduce, extrae, analiza: http://aplica.prompsit.com
>
>
>
> --
> Gema Ramírez
> ---------------------
> Prompsit LE
> Traduce, extrae, analiza: http://aplica.prompsit.com
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] "Interface for creating tagged corpora" GSOC 13

Reply via email to