Re: [Ankur-core] XML standard for Ankur's Abhidhan
On May 13, 2009, at 10:57 PM, Deepayan Sarkar wrote: On 5/12/09, Salahuddin Pasha salahuddi...@gmail.com wrote: Dear all, I was working on অভিধান - Abhidhan for XML support. To enable various application and tools to utilize our dictionary. Basic work is already done, but we need to define a standard XML (XML DTD or XML Schema). Any suggestion or comments ? Back in 2003, the bengalinux dictionary list had a discussion on this. Nothing ever came out of it, and when Golam first started on anubadok, his emphasis was more specialized. In any case, that discussion may provide some suggestions. You can get it from the list archives, and I'm also attaching a cleaned up and edited version of the thread here: ... From: Kaushik Ghose kgh...@wa... - 2003-05-16 15:07 ?xml version=1.0? !ELEMENT dictionary (entry*) !ELEMENT entry (word, info*) !ELEMENT word (#CDATA) !ELEMENT info (refer?,pron?, synonym?,antonym?,meaning?,grammar?) !ATTLIST info pos (n|adj|v|adv) n plural (true|false) false origin CDATA #DEFAULT date CDATA !ELEMENT refer (#CDATA) !ELEMENT pron (#CDATA) !ELEMENT synonym (#CDATA) !ATTLIST synonym lang CDATA #DEFAULT bn !ELEMENT antonym (#CDATA) !ATTLIST antonym lang CDATA #DEFAULT bn !ELEMENT meaning (#CDATA) !ATTLIST meaning lang CDATA #DEFAULT bn !ELEMENT grammar (derivative?) !ELEMENT derivative (#CDATA) !ATTLIST derivative form (the|of) the num (singular|plural) singular also, to answer Deepayan's question by date I was thinking of date of origin, first use etc. Will potter with QT right now, I'm goign to hardcode the DTD structure, I can't think of a simple way of creating an editor that will parse the DTD and configure the GUI on the fly - fixed boxes for all teh element will be quicker for this size DTD PS. try the perl tool at http://www.sagehill.net/livedtd/download.html -kg /thread Dear Deepayan bhai, Thank you for your mail. Here is the present updated one example: ?xml version=1.0 encoding=utf-8? dictionary search_results dict_entry bdict_id68218/bdict_id en_wordapple/en_word pos_tagProper noun, singular/pos_tag penn_tagNP/penn_tag bn_pronunciation/bn_pronunciation en_leema/en_leema bn_wordঅ্যাপল/bn_word explanation/explanation exampleউদাঃ/example statusEDITED/status /dict_entry /search_results /dictionary From Deepayan bhai's mail. I think we still need to add these fields. We will add this in later version as we do not have enough information for these fields now. origin=deshi synonyms.../synonyms antonyms.../antonyms entry info pos=noun plural=false origin=deshi synonyms.../synonyms antonyms.../antonyms /info /entry grammar derivative form=thechhaanaaTaa,chhaanaaTi/derivative derivative form=ofnum=singularchhaanaaTir/derivative derivative form=of num=pluralchhaanaader/derivative /grammar Another questions is which would better for us ? use grammer tag and store information in nested tags or the palin one in the present updated one. regards salahuddin -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] XML standard for Ankur's Abhidhan
You might also find it helpful to look at apertium dictionary format, which is also standard XML. Here is the link to svn for Nepalese Language (its the closest language to Bengali in apertium we have so far, and the Bengali pair is far from finished :( ) http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-bn-en/. I have been working to find some standard tag sets for Bengali language, so far I'm also doing away with pen treebank tagsets, but I the future I might need to extend those, as for my project requirements. *However, I bellive penn treebank tagset to be sufficient for a general purpose dictionary format.* The attached file contains the Pen Treebank Tagset and also the bilingual ductioanry format from apertium. What I'd like to propose is instead of using pos_tagVerb, non-3rd person singular present/ pos_tag you could create some definitions like verb, person, number, tense and then use them as the property for the specific entry. I'd be easier to parse in the future. On Wed, May 13, 2009 at 8:02 AM, Golam Mortuza Hossain gmhoss...@gmail.comwrote: Hi, On Tue, May 12, 2009 at 5:13 PM, Salahuddin Pasha salahuddi...@gmail.com wrote: Basic work is already done, but we need to define a standard XML (XML DTD or XML Schema). Example: test XML output. ?xml version=1.0 encoding=utf-8? dictionary search_results dict_entry id=1 en_wordread/en_word pos_tagNoun, singular or mass/pos_tag Thanks a lot for your work. I should suggest that you also try to have an entry for PennTag for Parts-of-Speech (pos) like NN, VV etc. So something like penn_tagNN/penn_tag This would be needed if Anubadok Online intreface needs to update its database using your XML gateway of Ankur dictionary database. Cheers, Golam -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core -- Regards Abu Zaher Md. Faridee http://zaher14.blogspot.com/ --- Time heals every wound, but time itself is a wound that never heals. -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] XML standard for Ankur's Abhidhan
On 5/12/09, Salahuddin Pasha salahuddi...@gmail.com wrote: Dear all, I was working on অভিধান - Abhidhan for XML support. To enable various application and tools to utilize our dictionary. Basic work is already done, but we need to define a standard XML (XML DTD or XML Schema). Any suggestion or comments ? Back in 2003, the bengalinux dictionary list had a discussion on this. Nothing ever came out of it, and when Golam first started on anubadok, his emphasis was more specialized. In any case, that discussion may provide some suggestions. You can get it from the list archives, and I'm also attaching a cleaned up and edited version of the thread here: thread from May 2003 [Ankur-dictionary] dictionary.dtd From: Kaushik Ghose kgh...@wa... -2003-05-14 04:17 Hi, here is the descriptor file. I'm new to XML and DTDs so please go over the semantics as well as the syntax an see if this serves our purpose... ?xml version=1.0? !ELEMENT entry*(word_bn, info_bn*) !ELEMENT word_bn (#CDATA) !ELEMENT info_bn (english, pronounciation_bn,meaning_bn) !ELEMENT english (#CDATA) !ELEMENT pronounciation_bn (#CDATA) !ELEMENT meaning_bn (#CDATA) thanks -kg From: Kaushik Ghose kgh...@wa... -2003-05-14 05:12 Ok, small correction, QTs DOM class seems to parse this correctly dictionary.dtd ?xml version=1.0? !ELEMENT dictionary (entry*) !ELEMENT entry (word_bn, info_bn*) !ELEMENT word_bn (#CDATA) !ELEMENT info_bn (english?, pronounciation_bn?,meaning_bn?) !ELEMENT english (#CDATA) !ELEMENT pronounciation_bn (#CDATA) !ELEMENT meaning_bn (#CDATA) test.xml ?xml version=1.0? !DOCTYPE entry SYSTEM dictionary.dtd dictionary entry word_bn? ???/word_bn info_bn englishseedling/english pronounciation_bnankur/pronounciation_bn meaning_bn??? ??? ?? ??/meaning_bn /info_bn /entry entry word_bn? ?/word_bn info_bn englishbangla/english pronounciation_bnbangla/pronounciation_bn meaning_bn??? ? ,? ??? ? ?/meaning_bn /info_bn info_bn englishbengali/english /info_bn /entry /dictionary thanks -kg From: Deepayan Sarkar deepa...@st... -2003-05-14 07:03 Ha! A friend of mine once corrected me on this, now I can correct someone else :) 'pronounciation' should be spelled 'pronunciation'. I'm not an expert on DTDs (though I know someone who knows much more, whom I can ask after after we make some progress). I find it very difficult to understand DTD's, and much easier to understand examples of what the final thing would look like. Let's work that way, and we can write out the DTD on ce we decide on the 'look'. I don't know if you know this, but there's something called attributes which might be useful. For instance, with multiple meanings as different parts of speech. Here's an example (I'm using slightly different tags) --- 'pos' is part of speech, 'plural' is whether the word has a plural form, etc.: entry wordchhaanaa/word info pos=noun plural=false origin=deshi meaningdudh theke toiri ek dhoroner .../meaning synonyms.../synonyms antonyms.../antonyms## ??? translation lang=encottage cheese (?)/translation pronunciationchhaanaa/pronunciation /info info pos=noun origin=tatbhabo #it's probably not, but... meaningshishu, bachchaa/meaning translation lang=enchild, young/translation # comma separated translation lang=hnbachcha/translation #hindi is hn ? not sure pronunciationchhaanaa/pronunciation derivative form=thechhaanaaTaa, chhaanaaTi/derivative derivative form=of num=singularchhaanaaTir/derivative derivative form=of num=pluralchhaanaader/derivative /info /entry (I've used romanized bengali in place of what should be bengali, but you get the idea.) I think we should handle derivative words here (and not have separate entries for them.
[Ankur-core] XML standard for Ankur's Abhidhan
Dear all, I was working on অভিধান - Abhidhan for XML support. To enable various application and tools to utilize our dictionary. Basic work is already done, but we need to define a standard XML (XML DTD or XML Schema). Any suggestion or comments ? Example: test XML output. ?xml version=1.0 encoding=utf-8? dictionary search_results dict_entry id=1 en_wordread/en_word pos_tagNoun, singular or mass/pos_tag bn_wordপড়া/bn_word /dict_entry dict_entry id=2 en_wordread/en_word pos_tagVerb, base form/pos_tag bn_wordপড়া/bn_word /dict_entry dict_entry id=3 en_wordread/en_word bn_pronunciation উচ্চাঃ রীড/ bn_pronunciation pos_tagVerb, non-3rd person singular present/ pos_tag bn_wordপাঠ করা/bn_word /dict_entry /search_results /dictionary regards salahuddin -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] XML standard for Ankur's Abhidhan
Hi, On Tue, May 12, 2009 at 5:13 PM, Salahuddin Pasha salahuddi...@gmail.com wrote: Basic work is already done, but we need to define a standard XML (XML DTD or XML Schema). Example: test XML output. ?xml version=1.0 encoding=utf-8? dictionary search_results dict_entry id=1 en_wordread/en_word pos_tagNoun, singular or mass/pos_tag Thanks a lot for your work. I should suggest that you also try to have an entry for PennTag for Parts-of-Speech (pos) like NN, VV etc. So something like penn_tagNN/penn_tag This would be needed if Anubadok Online intreface needs to update its database using your XML gateway of Ankur dictionary database. Cheers, Golam -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core