Re: [ilug-cal] dictionary project

Raja Guha Sat, 28 Sep 2002 12:34:10 -0700

A modest proposal
-----------------

I have been following this thread for some time and seeing the youthful 
enthusiasm find it necessary to add my 2p worth.


I would strongly urge you to talk to existing printed dictionary 
publishers and try to convince one of them to give or sell their 
Linotype files to you.  There might be dictionaries that are out of 
print or w/ expired copyrights.  It should not be too much work to write 
a script to convert Linotype to Unicode - if such does not already exist.

Some stats:
-----------
An illiterate speaker has a vocabulary of 2,000 words.

An average high school grad. has a vocabulary of 5,000 words.

A educated person (i.e. most computer users in India) has a  vocabulary 
of 10,000 words

A typical English pocket dictionary contains 60,000 words.

Since you can expect your user to abandon the dictionary 2 or 3 attempts 
if it  fails to list sought words , you will need to create a dictionary 
that has a minimum of 20,000 to 30,000 words before you can foist it on 
the public.

Some simple estimates:
---------------------
Assuming 1 hour per word for research and typing (that's a pure guess, I 
believe it may be optimistic), and assuming 200 hours per man-year 
(remember, man-year for a regular day job is about 2000 hours)
- we are looking at 100 man-years minimum.  That is 100 persons 
dedicatedly working for a little under 2 hours *every* other day for a 
whole year. Or 50 for 2 years. Or 1 for a 100 years

Now, do you guys have that kind of dedication/manpower/time?

For that kind of effort you are better of trying to adapt an OS OCR and 
scanning in an old dictionary!

Then again you might want to do the project without actually thinking of 
  reaching

raja




Kaushik Ghose wrote:
> Hi Joy,
> Yee hah !
> Ok, lets get the ball rolling, to address your issues and some more...
> 
> 1. Timeline
> OK, I'm no good at timelines, but I'd say 1 month to write/find the code
> to run the dictionary page and to debug it. I have  a leetle idea of cgi
> scripts, I could read up (this is busy season here at the lab now, so no
> promises) about cgi and write the code to generate the pages, if there are
> java experts/php experts out there who could do the thing _now_ all the
> better.
> Volunteers !
> 
> 2. Format/storage
> We should plan on one "central" fromat from which we could quickly convert
> to others. I would say 90% of the job is to agree on a standard.
> I would propose XML. I have little knowledge of the details, but I assume
> this is the best for structured data (like this dic), that needs to be
> available cross platform which at a pinch needs to be human readable.
> 
> so I would propose the basic word unit to be as:
> 
> <word bangla="...">
>  <bangla pronounciation="...">
>  <bangla meaning="...">
>  <bangla synonym="...">
>  <bangla synonym="...">
>  ....
>  <bangla note="...">   (this may be etymology etc. etc.)
>  <english translation="...">
>  <english pronounciation="..">
>  <english meaning="...">
>  <english synonym="...">
>  <english synonym="...">
>  ....
>  <english note="...">
> </word>
> 
> So the primary key would be bangla, but we'd be able to do a reverse dic
> by generating an XML file ordered with the <english translation="...">
> field to make a english->bangla dict, 
> 
> We could have several XML files each corresponding to one bangla letter,
> but memory and processing power is cheap nowadays...
> 
> Anyone who's conversant with XML, is this valid ?
> 
> Additions, subtractions, possible pitfalls ?
> 
> 
> 3. Fonts/ webpage/ website
> We should generate the pages in unicode, and have a link to Sayamindu's
> font - in case the user hasn't got a otf yet that supports bangla
> 
> So we'd have the main interface simply as
> 
> translate ________ [go]
> synonyms _________ [go]
> look for ________ [go]
> [browse dictionary]
> [contribute]
> 
> translate will look for correspondence in the XML file for a
> bangla/english word and display all the gory details in all the fields
> 
> synonyms will go through the "synonyms" field and return the list of
> synonyms along with their full "translate" data (for each)
> 
> look for will go through the "meaning" field and return words that contain
> this keyword in the "meaning' field
> 
> [browse] will act like a paper dictionary, you'll have a clickable
> alphabar which'll take you to the respective pages
> 
> [contribute] will take you to a page that'll present a new bangla word and
> you have to fill up the fields as best you can.
> This will then go into the dictionary.
> 
> 4. CD
> Yes, basically the same code that runs the website could be bundled along
> with the database of words, all the user then needs is a browser with
> unicode/otf support.
> 
> We could also write a small app to be a standalone dict. 
> 
> The first option would be quicker, less buggy and more cross platform
> The second would free the user from having to hunt down something that
> profides true otf/unciode support
> 
> Lets do both... 
> 
> 5. OpenOffice
> Yes, I'm not conversant with their format, but XML is good...
> 
> 
> -kaushik
> 
> 
> 
> --
> To unsubscribe, send mail to [EMAIL PROTECTED] with the body
> "unsubscribe ilug-cal" and an empty subject line.
> FAQ: http://www.ilug-cal.org/help/faq_list.html
> 
> 


-- 
Raja Guha
---------
Paranoids are people, too; they have their own problems.  It's easy to
criticize, but if everybody hated you, you'd be paranoid too.
                -- D. J. Hicks


--
To unsubscribe, send mail to [EMAIL PROTECTED] with the body
"unsubscribe ilug-cal" and an empty subject line.
FAQ: http://www.ilug-cal.org/help/faq_list.html

Re: [ilug-cal] dictionary project

Reply via email to