--- INTERSPEECH 2014 - SINGAPORE
--- September 14-18, 2014
--- http://www.INTERSPEECH2014.org



The INTERSPEECH 2014 Organising Committee is pleased to announce
the following 8 tutorials presented by distinguished speakers
at the conference and will be offered on Sunday, 14 September 2014.
All Tutorials will be of three (3) hours duration and require
an additional registration fee (separate from the conference
registration fee).

    • Non-speech acoustic event detection and classification
    • Contribution of MRI to Exploring and Modeling Speech Production
    • Computational Models for Audiovisual Emotion Perception
    • The Art and Science of Speech Feature Engineering
    • Recent Advances in Speaker Diarization
    • Multimodal Speech Recognition with the AusTalk 3D
      Audio-Visual Corpus
    • Semantic Web and Linked Big Data Resources for
      Spoken Language Processing
    • Speech and Audio for Multimedia Semantics


----------------------------------------------------------------------------------------------------
ISCSLP Tutorials @ INTERSPEECH 2014
----------------------------------------------------------------------------------------------------

Additionally, the ISCSLP 2014 Organising Committee welcomes
the INTERSPEECH 2014 delegates to join the 4 ISCSLP tutorials
which will be offered on Saturday, 13 September 2014.

    • Adaptation Techniques for Statistical Speech Recognition
    • Emotion and Mental State Recognition: Features, Models, System
      Applications and Beyond
    • Unsupervised Speech and Language Processing via Topic Models
    • Deep Learning for Speech Generation and Synthesis


More information available at:
http://www.interspeech2014.org/public.php?page=tutorial.html


----------------------------------------------------------------------------------------------------
Tutorials Description
----------------------------------------------------------------------------------------------------

T1: Non-speech acoustic event detection and classification

    The research in audio signal processing has been dominated by
    speech research, but most of the sounds in our real-life
    environments are actually non-speech events such as cars passing
    by, wind, warning beeps, and animal sounds. These acoustic events
    contain much information about the environment and physical
    events that take place in it, enabling novel application areas such
    as safety, health monitoring and investigation of biodiversity.
    But while recent years have seen wide-spread adoption of
    applications such as speech recognition and song recognition,
    generic computer audition is still in its infancy.

    Non-speech acoustic events have several fundamental differences to
    speech, but many of the core algorithms used by speech researchers
    can be leveraged for generic audio analysis. The tutorial is a
    comprehensive review of the field of acoustic event detection as it
    currently stands. The goal of the tutorial is foster interest in
    the community, highlight the challenges and opportunities and
    provide a starting point for new researchers. We will discuss what
    acoustic event detection entails, the commonalities differences
    with speech processing, such as the large variation in sounds and
    the possible overlap with other sounds. We will then discuss basic
    experimental and algorithm design, including descriptions
    of available databases and machine learning methods. We will then
    discuss more advanced topics such as methods to deal with
    temporally overlapping sounds and modelling the relations between
    sounds. We will finish with a discussion of
    avenues for future research.

    Organizers: Tuomas Virtanen and Jort F. Gemmeke


T2: Contribution of MRI to Exploring and Modeling Speech Production

    Magnetic resonance imaging (MRI) provides us a magic vision to look
    into the human body in various ways not only with static imaging
    but also with motion imaging. MRI has been a powerful technique for
    speech research to study finer anatomy of the speech organs or to
    visualize true vocal tracts in three dimensions. Inherent problems
    of slow image acquisition for speech tasks or insufficient signal-
    to-noise ratio for microscopic observation have been the cost for
    researchers to search for task-specific imaging techniques.
    The recent advances of the 3-Tesla technology suggest more
    practical solutions to broader applications of MRI by overcoming
    previous technical limitations. In this joint tutorial in two
    parts, we summarize our previous effort to accumulate
    scientific knowledge with MRI and to advance speech modeling
    studies for future development. Part I, given by Kiyoshi Honda,
    introduces how to visualize the speech organs and vocal tracts by
    presenting techniques and data for finer static
    imaging, synchronized motion imaging, surface marker tracking,
    real-time imaging, and vocal-tract mechanical modeling. Part 2,
    presented by Jianwu Dang, focuses on applications of MRI for
    phonetics of Mandarin vowels, acoustics of the vocal tracts
    with side branches, analysis and simulation in search of talker
    characteristics, physiological modeling of the articulatory system,
    and motor control paradigm for speech articulation.

    Organizers: Kiyoshi HONDA and Jianwu DANG


T3: Computational Models for Audiovisual Emotion Perception

    In this tutorial we will explore engineering approaches to
    understanding human emotion perception, focusing both on modeling
    and application. We will highlight both current and historical
    trends in emotion perception modeling, focusing on
    both psychological and engineering-driven theories of perception
    (statistical analyses, data-driven computational modeling, and
    implicit sensing). The importance of this topic can be appreciated
    from both an engineering viewpoint, any system that either models
    human behavior or interacts with human partners must
    understand emotion perception as it fundamentally underlies and
    modulates our communication, or from a psychological perspective,
    emotion perception is also used in the diagnosis of many mental
    health conditions and is tracked in therapeutic
    interventions. Research in emotion perception seeks to identify
    models that describe the felt sense of ‘typical’ emotion expression
    – i.e., an observer/evaluator’s attribution
    of the emotional state of the speaker. This felt sense is a
    function of the methods through which individuals integrate the
    presented multimodal emotional information.
    We will cover psychological theories of emotion, engineering models
    of emotion, and experimental approaches to measure emotion. We will
    demonstrate how these modeling
    strategies can be used as a component of emotion classification
    frameworks and how they can be used to inform the design of
    emotional behaviors.

    Organizers: Emily Mower Provost and Carlos Busso


T4: The Art and Science of Speech Feature Engineering

    With significant advances in mobile technology and audio sensing
    devices, there is a fundamental need to describe vast amounts of
    audio data in terms of well representative lower dimensional
    descriptors for efficient automatic processing. The extraction of
    these signal representations, also called features,
    constitutes the first step in processing a speech signal. The art
    and science of feature engineering relates to addressing the two
    inherent challenges - extracting sufficient information from the
    speech signal for the task at hand and suppressing
    the unwanted redundancies for computational efficiency and
    robustness. The area of speech feature extraction combines a wide
    variety of disciplines like signal processing, machine learning,
    psychophysics, information theory, linguistics and physiology.
    It has a rich history spanning more than five decades and has seen
    tremendous advances in the last few years. This has propelled the
    transition of the speech technology from controlled environments to
    millions of end user applications.

    In this tutorial, we review the evolution of speech feature
    processing methods, summarize the recent advances of the last two
    decades and provide insights into the future of feature
    engineering. This will include the discussions on the spectral
    representation methods developed in the past, human auditory
    motivated techniques for robust speech processing, data driven
    unsupervised features like ivectors and recent advances in deep
    neural network based techniques. With experimental results,
    we will also illustrate the impact of these features for various
    state-of-the-art speech processing systems. The future of speech
    signal processing will need to address
    various robustness issues in complex acoustic environments while
    being able to derive useful information from big data.

    Organizers: Sriram Ganapathy and Samuel Thomas


T5: Recent Advances in Speaker Diarization

    The tutorial will start with an introduction to speaker diarization
    giving a general overview of the subject. Afterwards, we will cover
    the basic background including
    feature extraction, and common modeling techniques such as GMMs and
    HMMs. Then, we will discuss the first processing step usually done
    in speaker diarization which is voice activity detection. We will
    consequently describe the classic approaches
    for speaker diarization which are widely used today. We will then
    introduce state-of-the-art techniques in speaker recognition
    required to understand modern speaker diarization techniques.
    Following, we will describe approaches for speaker diarization
    using advanced representation methods (supervectors, speaker
    factors, i-vectors) and we will describe supervised and
    unsupervised learning techniques used for speaker diarization. We
    will also discuss issues such as coping with unknown number of
    speakers, detecting and dealing with overlapping speech,
    diarization confidence estimation, and online speaker diarization.
    Finally we will discuss two recent works: exploiting a-prioiri
    acoustic information (such as processing a meeting
    when some of the participants are known in advanced to the system,
    and training data is available for them),
    The second recent work is modeling speaker-turn dynamics. If time
    permits, we will also discuss concepts
    such as multi-modal diarization and using TDOA (time difference of
    arrival) for diarization of meetings.

    Organizers: Hagai Aronowitz


T6: Multimodal Speech Recognition with the AusTalk 3D Audio-Visual
    Corpus

    This tutorial will provide attendees a brief overview of 3D based
    AVSR research. In this tutorial, attendees will learn how to use
    the newly developed 3D based audio visual data corpus we derived
    from the AusTalk corpus (https://austalk.edu.au/)
    for audio-visual speech/speaker recognition. In addition, we also
    plan to introduce some results using this newly developed 3D audio-
    visual data corpus, which show that there is a significant speech
    accuracy increase by integrating both depth-level and grey-level
    visual features. In the first part of the tutorial, we will review
    some recent works published in the last decade, so that attendees
    can obtain an overview of the fundamental concepts
    and challenges in this field. In the second part of the tutorial,
    we will briefly describe the recording protocol and contents of the
    3D data corpus, and show attendees how to use
    this corpus for their own research. In the third part of this
    tutorial, we will present our results using the 3D data corpus. The
    experimental results show that, compared with the
    conventional AVSR based on the audio and grey-level visual
    features, the integration of grey and depth visual information can
    boost the AVSR accuracy significantly. Moreover,
    we will also experimentally explain why adding depth information
    can benefit the standard AVSR systems. Eventually, through our
    tutorial, we hope we can inspire more researchers in the community
    to contribute to this exciting research.

    Organizers: Roberto Togneri, Mohammed Bennamoun and Chao (Luke) Sui


T7: Semantic Web and Linked Big Data Resources for Spoken Language
    Processing

    State-of-the-art statistical spoken language processing typically
    requires significant manual effort to construct domain-specific
    schemas (ontologies) as well as manual effort to annotate training
    data against these schemas. At the same time, a recent surge of
    activity and progress on semantic web-related
    concepts from the large search-engine companies represents a
    potential alternative to the manually intensive design of spoken
    language processing systems. Standards such as schema.org have been
    established for schemas (ontologies) that webmasters can use to
    semantically and uniformly markup their web pages.
    Search engines like Bing, Google, and Yandex have adopted these
    standards and are leveraging them to create semantic search engines
    at the scale of the web. As a result, the open linked data
    resources and semantic graphs covering various domains (such as
    Freebase [3]) have grown massively every year and contains far more
    information than any single resource anywhere on the Web.
    Furthermore, these resources contain links to text data (such as
    Wikipedia pages) related to the knowledge in the graph.

    Recently, several studies on speech language processing started
    exploiting these massive linked data resources for language
    modeling and spoken language understanding. This tutorial will
    include a brief introduction to the semantic web and the linked
    data structure, available resources, and querying languages.
    An overview of related work on information extraction and language
    processing will be presented, where the main focus will be on
    methods for learning spoken language
    understanding models from these resources.

    Organizers: Dilek Hakkani-Tür and Larry Heck


T8: Speech and Audio for Multimedia Semantics

    Internet media sharing sites and the one-click upload capability of
    smartphones are producing a deluge of multimedia content. While
    visual features are often dominant in such material, acoustic and
    speech information in particular often complements it.
    By facilitating access to large amounts of data, the text-based
    Internet gave a huge boost to the field of natural language
    processing. The vast amount of consumer-produced video becoming
    available now will do the same for video processing, eventually
    enabling semantic understanding of multimedia material, with
    implications for human computer interaction, robotics, etc.

    Large-scale multi-modal analysis of audio-visual material is now
    central to a number of multi-site research projects around the
    world. While each of these have slightly different targets, they
    are facing largely the same challenges: how to robustly and
    efficiently process large amounts of data, how to represent and
    then fuse information across modalities, how to train classifiers
    and segmenters on unlabeled data, how to include human feedback,
    etc.

    In this tutorial, we will present the state of the art in
    large-scale video, speech, and non-speech audio processing, and
    show how these approaches are being applied to tasks
    such as content based video retrieval (CBVR) and multimedia event
    detection (MED). We will introduce the most important tools and
    techniques, and show how the combination of
    information across modalities can be used to induce semantics on
    multimedia material through ranking of information and fusion.
    Finally, we will discuss opportunities
    for research that the INTERSPEECH community specifically will find
    interesting and fertile.

    Organizers: Florian Metze and Koichi Shinoda


----------------------------------------------------------------------------------------------------
ISCSLP Tutorials @ INTERSPEECH 2014 Description
----------------------------------------------------------------------------------------------------

ISCSLP-T1: Adaptation Techniques for Statistical Speech Recognition

    Adaptation is a technique to make better use of existing models for
    test data from new acoustic or linguistic conditions. It is an
    important and challenging research area of statistical speech
    recognition. This tutorial gives a systematic
    review of fundamental theories as well as introduction of state-
    of-the-art adaptation techniques. It includes both acoustic and
    language model adaptation. Following a simple example
    of acoustic model adaptation, basic concepts, procedures and
    categories of adaptation will be introduced. Then, a number of
    advanced adaptation techniques will be discussed,
    such as discriminative adaptation, Deep Neural Network adaptation,
    adaptive training, relationship to noise robustness etc. After the
    detailed review of acoustic model adaptation,
    an introduction of language model adaptation, such as topic
    adaptation will also be given. The whole tutorial is then
    summarised and future research direction will be discussed.

    Organizers: Kai Yu


ISCSLP-T2: Emotion and Mental State Recognition: Features, Models,
           System Applications and Beyond

    Emotion recognition is the ability to identify what you are feeling
    from moment to moment and to understand the connection between your
    feelings and your expressions. In today’s world, human-computer
    interaction (HCI) interface undoubtedly plays an important role in
    our daily life. Toward harmonious HCI interfaces, automated
    analysis and recognition of human emotion has attracted increasing
    attention from researchers in multidisciplinary research fields. A
    specific area of current interest that also has key implications
    for HCI is the estimation of cognitive load (mental workload),
    research into which is still at an early stage. Technologies for
    processing daily activities including speech, text and music have
    expanded the interaction modalities between humans and computer-
    supported communicational artifacts.

    In this tutorial, we will present theoretical and practical work
    offering new and broad views of the latest research in emotional
    awareness from audio and speech. We discuss several parts
    spanning a variety of theoretical background and applications
    ranging from salient emotional features,
    emotional-cognitive models, compensation methods for variability
    due to speaker and linguistic content, to machine learning
    approaches applicable to emotion recognition. In each topic, we
    will review the state of the art by introducing current methods and
    presenting several applications. In particular, the application to
    cognitive load estimation will be discussed, from its
    psychophysiological origins to system design considerations.
    Eventually, technologies developed in different areas will be
    combined for future applications, so in addition to a survey of
    future research challenges, we will envision a few scenarios in
    which affective computing can make a difference.

    Organizers: Chung-Hsien Wu, Hsin-Min Wang, Julien Epps and
                Vidhyasaharan Sethu


ISCSLP-T3: Unsupervised Speech and Language Processing via Topic Models

    In this tutorial, we will present state-of-art machine learning
    approaches for speech and language processing with highlight on the
    unsupervised methods for structural learning from the unlabeled
    sequential patterns. In general, speech and language processing
    involves extensive knowledge of statistical models. We require
    designing a flexible, scalable and robust system to meet
    heterogeneous and non-stationary environments in the era of big
    data. This tutorial starts from an introduction of unsupervised
    speech and language processing based on factor analysis and
    independent component analysis. The unsupervised learning is
    generalized to a latent variable model which is known as the topic
    model. The evolution of topic models from latent semantic analysis
    to hierarchical Dirichlet process, from non-Bayesian parametric
    models to Bayesian nonparametric models, and from single-layer
    model to hierarchical tree model shall be surveyed in an organized
    fashion. The inference approaches based on variational Bayesian and
    Gibbs sampling are introduced. We will also present several
    case studies on topic modeling for speech and language applications
    including language model, document model, retrieval model,
    segmentation model and summarization model. At last, we will point
    out new trends of topic models for speech and language processing.

    Organizers: Jen-Tzung Chien


ISCSLP-T4: Deep Learning for Speech Generation and Synthesis

    Deep learning, which can represent high-level abstractions in data
    with an architecture of multiple non-linear transformation, has
    made a huge impact on automatic speech recognition (ASR)
    research, products and services. However, deep learning for speech
    generation and synthesis (i.e., text-to-speech), which is an
    inverse process of speech recognition (i.e., speech-to-text),
    has not generated the similar momentum as it is for ASR yet.
    Recently, motivated by the success of Deep Neural Networks in
    speech recognition, some neural network based research attempts
    have been tried successfully on improving the performance of
    statistical parametric based speech generation/synthesis. In this
    tutorial, we focus on deep learning approaches to the problems in
    speech generation and synthesis, especially on Text-to-Speech (TTS)
    synthesis and voice conversion.

    First, we give a review for the current main stream of statistical
    parametric based speech generation and synthesis, or the GMM-HMM
    based speech synthesis and GMM-based voice conversion with emphasis
    on analyzing the major factors responsible for the quality problems
    in the GMM-based voice synthesis/conversion and the intrinsic
    limitations of a decision-tree based, contextual state
    clustering and state-based statistical distribution modeling. We
    then present the latest deep learning algorithms for feature
    parameter trajectory generation, in contrast to deep learning for
    recognition or classification. We cover common technologies in Deep
    Neural Network (DNN) and improved DNN: Mixture Density Networks
    (MDN), Recurrent Neural Networks (RNN) with Bidirectional Long
    Short Term Memory (BLSTM) and Conditional RBM (CRBM). Finally, we
    share our research insights and hand-on experience on building
    speech generation and synthesis systems based upon deep learning
    algorithms.

    Organizers: Yao Qian and Frank K. Soong








_______________________________________________
Mt-list site list
[email protected]
http://lists.eamt.org/mailman/listinfo/mt-list

Reply via email to