[gna-register] [task #1162] Submission of Language-Independent Text Breaker Module

Vee Satayamas Sat, 15 Jan 2005 20:33:36 +0100

This is an automated notification sent by Gna!.
It relates to:
                task #1162, project Gna! Administration


==============================================================================
 OVERVIEW of task #1162:
==============================================================================

URL:
  <http://gna.org/task/?func=detailitem&item_id=1162>

                 Summary: Submission of Language-Independent Text Breaker
Module
                 Project: Gna! Administration
            Submitted by: vee
            Submitted on: Sun 01/16/2005 at 02:33
                  Status: None
         Approval Status: None
         Should Start On: 
   Should be Finished on: 
                Category: Project Approval
                Priority: 5 - Normal
                 Privacy: Public
             Assigned to: None
        Percent Complete: 0%
             Open/Closed: Open
                  Effort: 0.00

    _______________________________________________________



Site Admin. Approval/Edition URL:
 <https://gna.org/admin/groupedit.php?group_id=890>


###### ORIGINAL SUBMISSION DETAILS ######

System Group Name:
-----------------
  textbreak


Full Name:
----------
  Language-Independent Text Breaker Module
  

Type:
-----
  Programs


License:
-------- 
  GNU Lesser General Public License


Other License: 
--------------
  


Description:
------------
  Text Breaker plays an important role on the text layout, Pango - for
instance. However language dependent engine is used, pango-libthai - for
example. 



Language-Independent Text Breaker Module is program that can break (aka
segment) multi-lingual text into the units, for instance word and sentence by
one engine. 



Other projects provide excellent usable multi-lingual text breaker. However,
there is these 3 following weak points. 



1. it is large  hence to embedded it to other application is hard. 



2. Its segmentation method is too specific which is effect precision of
natural language segmentation that is not trivial.



3. It was implemented by C++ which is not as portable as pure C and bring
incompatibility problem.



Language-Independent Text Breaker will be implemented by using the paper ``A
Formalism for Universal'' as core guide. We will employ reusable component and
experience from a specific text breaker that is Thai Word segmentation (aka
Thai word breaking ) http://thaiwordseg.sourceforge.net and It is going to be
implement by pure C with LGPL/MPL. We are going to map every possible
segmentation to multi-level directed acyclic graph. The solution can be find
by find shortest path. Possible segments are edges in graph. There are common
to all languages. However, how to add edges to graph is different. We provide
3 distinct methods to add edges to graph that are dictionary, rules and
statistical method. 



In brief, we will make Language-Independent Text Breaker Module that can
segment text strings of any language by one engine, which mean that our
segmentation method is flexible enough. Our implementation will follow Julien
Quint 's work - A Formalism for Universal Segmentation of Text. Moreover, it
is going to be implemented by Pure C with LGPL/MPL. Finally, it will be small
enough to embedded to other programs.


Other Software Required:
------------------------
  glib2


Other Comments:
---------------
  

#########################################







==============================================================================

This item URL is:
  <http://gna.org/task/?func=detailitem&item_id=1162>

_______________________________________________
  Message sent via/by Gna!
  http://gna.org/

[gna-register] [task #1162] Submission of Language-Independent Text Breaker Module

Reply via email to