Re: Google Summer of Code 2014 - Phonetic Matching Project

Rupert Westenthaler Tue, 11 Mar 2014 04:49:35 -0700

Hi Alain,

great to see your interest in Stanbol and Phonetic Linking. Let me
give you some more context and try to answer your questions.

On Tue, Mar 11, 2014 at 3:27 AM, Alain Boulay <aj_bou...@laurentian.ca> wrote:
> This is what I understand so far. . ..(from your website STANBOL 1291
> Phonetic Linking)
>
> "The main question to be answers is if the phonetic matching (step 4) can
> correctly link Entities even if the writings in the text transcript are
> incorrect."

STANBOL-1291:  Phonetic Linking has two major parts

(1) speech to text: Currently Apache Stanbol does not include such an
engine. STANBOL-1007 suggests to implement such an engine based on CMU
Sphinx.
(2) Phonetic linking: link entities based on phonetics (and not he
writing) of their labels. Here the issue suggests to use/extend the
FST linking engine (STANBOL-1128) so that is can use the Solr
PhoneticFilterFactory.

First you will need to decide if you would you like to cover both or
only (2) in your GSoC Proposal? I had the impression that you are more
interested in (2). So if you would like to exclude (1) you could
manually use CMU Sphinx to generate transcripts and send those to
Stanbol.

>
> Perhaps  'soft computing methods' are the best way to answer this question:
> Neural Networks, Baysian, Fuzzy Sets or Rough Sets because these methods
> would score  well even if the 'writings in the text transcript are
> incorrect'.
>
> I can address this question on many levels given my experience:
> - Computational Linguistics - Experience in coding Artificial Neural
> Networks that will learn phonetic speech. This also applies to text
> recogniton and the generation of grammaticical rules from the language
> input. I saw that the text to speech engine (Stanbol) uses Sphinx that is
> built using Baysean approaches (now you have got me really excited!). I
> would be very interested in working with STANBOL engine to produce tests or
> measures of how well it is linking entities based on the performance as a
> NLP engine along the lines of pattern matching. My experience in working
> with these kinds of networks is with Neural Net simulator (T Learn) and
> coding MATLAB neural nets.
>

You suggest to implement an own approach for (2)? Interesting ... I am
not an expert in Computational Linguistics so most likely you would
need to teach me in that area.

When writing your proposal please provide some information on how you
would train those  'soft computing methods' for vocabularies. Please
also provide information on how this methods would scale in relation
to the number of entities. To provide some context: DBPedia has ~5
million entities; Freebase has ~40 million; typical custom
vocabularies are below 10000 with some having up to 250000 entities.

As Example with the FST linking + PhoneticFilterFactory method:
Training means to index the labels of the Vocabulary by using the Solr
PhoneticFilterFactory. After indexing one needs to build an FST model
over the field. This approach would easily scale to vocabularies with
40 million entities.

> - Text Quality - this would require some kind of examination between a
> trusted sample of the original data and the output text. Experimental
> statistical methods would provide measures, and empiracle computational
> methods may provide means of improvement. However, you know the needs and
> if I may have your insight or advice regarding the parameters, I am sure
> that I can produce an excellent proposal.
>

For validation with prepared transcripts the Stanbol Benchmarking Tool
could be used. It does use a very simple syntax (see [1] slide 6).
However this tool is currently not able to execute automated tests and
provide summaries over multiple tests (see  the open issue STANBOL-652
[2]). If you know other tools that can be used for evaluation feel
free to suggest.

> -  I am PASSIONATE about coding Neural Nets and Baysian nets regarding
> languge processing, and have been a student member of academic labs that
> focus on human cognition and language processing (psycholinguistics). I
> also have a very strong interest in becoming active in semantic web
> development. I currently study Human-Computer Interaction at Laurentian,
> and so speech interfaces are really exciting to me. . .I would really love
> to have a chance to code with you for the summer (and afterwords too!)
> because it would bring me the kind of experience that I cant get here at
> the university.
>
> I am most interested in answers to these kinds of questions (which are
> specific to my application)
>
>
>    - a list of deliverables, quantifiable results for the Apache community,
>    (I am not sure what deliverables will meet your needs, can you suggest?)

The EnhancementEngine implementing the phonetic linking. Possible
multiple if your proposal includes testing multiple alternatives. A
documentation of the engine to be includes on the Stanbol Webpage
under [3]. If you have a Blog you should also provide updates about
the progress. A presentation with the approach would also be good to
have.

>    - a detailed description / design document, (I am interested in
>    following standards put forward by Apache for design documentation -
>    perhaps I would be able to see guidelines to give me an idea of what to
>    submit)

This kind of documentation is done by JIRA issues. Discussions take
place on the dev mailing list. Development takes place in public
repositories.

>    - an approach, (I would want to meet expectations in the approach)

I am not an expert in that field. So I can not help with the approach.
However if you choose a different one I might find the time to
implement the FST linking + PhoneticFilterFactory method. So that one
can compare results between those.

>    - an approximate schedule and

For a GSoC proposal the mid-term evaluation is the most important date
as your potential mentor will need to evaluate your work at this
point.

>    - something of a background text. (does this mean literature search,
>    citations? I can provide these, or what ever else is needed)

Last year I was looking at some of the cited literature while
reviewing proposals.

Finally note that for the inclusion of you results all used
dependencies need to be compatible to the Apache Software License (see
[4] for more information). So please check if the frameworks you plan
to use do use a license that is compatible to the ASL.

best
Rupert Westenthaler

>
> Thanks very much for your advice!! I will submit a proposal as soon as I
> receive your reply!
>
> AJ Boulay, MSc

[1] http://www.slideshare.net/bdelacretaz/bertrand-stanbolbenchmarksapril2011
[2] https://issues.apache.org/jira/browse/STANBOL-652
[3] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list
[4] https://www.apache.org/legal/3party.html#categories

-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Google Summer of Code 2014 - Phonetic Matching Project

Reply via email to