Re: Please respond – A better wa y to organize the Quranic Arabic Corpus dictionary for version 0 .3?

Mohammad Mostafa Tue, 30 Nov 2010 08:16:30 -0800

Dear Kais:

You effort you have in this project. I think the next step is to add the 
following:


- Phrase usage. Links to usages.
- Holy Quranic Ontology. Showing relationship between words.


If you can build an ontology of this effort, the project will become the best 
core ever all Arabic NLP researchs to start with, and extend to. Please keep up 
the good job.


  ----- Original Message ----- 
  From: Kais Dukes 
  To: comp-quran@comp.leeds.ac.uk ; yasirja...@gmail.com ; tku....@gmail.com 
  Sent: Tuesday, November 30, 2010 5:41 PM
  Subject: Please respond – A better way to organize the Quranic Arabic Corpus 
dictionary for version 0.3?


  PLEASE HIT "REPLY ALL" WHEN RESPONDING TO THIS E-MAIL – THANKS!



  Hello All,




  Hopefully sometime over the new few weeks, there will be an updated version 
of the Quranic Arabic Corpus (version 0.3 - see below). I am hoping to get 
people's feedback on this upcoming release, but also on a specific idea. 



  My question is – do you think we can better organize the Quranic Arabic 
Corpus dictionary? To be honest, this is more of a concordance. Please see:



  http://corpus.quran.com/qurandictionary.jsp



  At the moment, they way the dictionary page works is that you specify a root, 
and then you get back a list of words. The word list for a specific root is 
organized by form, then by part-of-speech (noun or verb) and then by person, 
gender and number. If you click on a specific word form, you get taken to that 
verse in the Quran.



  Although this was a good starting point, I would be keen to better organize 
this to be more like a dictionary. How about the following suggestion. We still 
keep the top-level as root, but we then make the next subdivision to be lemma. 
Under different lemmas we can show different forms of inflection.



  Also what about website navigation and hyperlinks for the dictionary, any 
ideas?




  I’m really keen to improve the dictionary - the audience I have in mind is 
everyday users of the website who are mostly people wanting to learn Arabic 
specifically with the intent of understanding the original text of the Quran.



  It would also be great to get feedback on the web pages which show lists of 
lemmas and verbs, e.g.



  http://corpus.quran.com/verbs.jsp

  http://corpus.quran.com/lemmas.jsp



  Please note that I’m not looking to add any new information to the corpus at 
the moment, just a reorganization of the data to make things more readable and 
accessible for our average user.



  PLEASE HIT "REPLY ALL" WHEN RESPONDING TO THIS E-MAIL – THANKS!



  ========================================




  RELEASE NOTES -   Quranic Arabic Corpus version 0.3



  The Quranic Arabic Corpus (http://corpus.quran.com) is an international 
collaborative linguistic project initiated at the University of Leeds that aims 
to bridge the gap between the traditional Arabic grammar of i'rab and 
techniques from modern computational linguistics. This open source resource 
includes word-by-word part-of-speech tagging for the Quran, morphological 
segmentation and a formal representation of Quranic Arabic syntax using 
dependency graphs. Version 0.3 of the corpus includes a number of significant 
improvements over the previous 0.2 release:



  Increased coverage for the syntactic treebank. The treebank now covers 30% of 
the Quran by word count (hence the version 0.3 release number). The syntactic 
treebank provides annotation using dependency grammar for chapters 1-5 and 
59-114, covering 23,292 out of 77,430 words in the Quran. The treebank also 
includes a revised set of non-terminal phrase tags for nominal sentences 
(jumlah ismiyah), verbal sentences (jumlah fi'liyah), and conditional sentences 
(jumlah shartiyah),



  Improved accuracy for tagging and morphological analysis covering 100% of the 
Quranic text. Following online collaboration by volunteer annotators, the 
part-of-speech tags and morphological analyses for over 500 words have been 
reviewed in detail and cross checked against traditional sources of Arabic 
grammar, resulting in further improvements to the accuracy of the annotated 
resource.



  More consistent morphological segmentation. Each of the 77,430 words in the 
Quran has been automatically segmented, resulting in 128,068 distinct 
morphemes. In accordance with traditional Arabic grammar, each morpheme has 
been separately tagged for part-of-speech and multiple morphological features 
including noun case and verb mood, gender, number and person. The improved 
segmentation used in version 0.3 of the corpus is more consistent with i'rab. 
For example, the suffixed nun of emphasis (nun l-tawkeed) is now correctly 
analysed as a separate morphological segment.



  High-resolution vector graphics for the Quranic script is now used to display 
Arabic words in dependency graphs, replacing the previous use of glyph-based 
fonts. The script is now based on electronic scans developed by the Quran 
Printing Complex. This has resulted in improved typographic accuracy for the 
Arabic words displayed in the syntactic treebank, most notably for ligatures, 
verse pause marks, and diacritic alignment. Previously a TrueType font was used 
to render Arabic words in dependency graphs, which did not always accurately 
represent the intricacies of the Quranic Uthmani script.



  An extended tagset with finer grained part-of-speech tags including INT - 
particle of interpretation (ḥarf tafseer), CIRC - for the circumstantial usage 
of the particle waw (waw l-haliyah), COM - for the comitative usage of the 
particle waw (waw l-ma'iyah) and RSLT (for the result usage of the particle 
fa). In addition, for better consistency with traditional Arabic grammar, the 
NUM tag has been replaced for numerical words with ADJ (adjective) or N (noun) 
tags, depending on syntactic function and context.



  Better natural language generation for automatic summaries of linguistic 
annotation. For example, when a first person object pronoun suffix is 
represented only by a terminal kasrah diacritic (instead of the more usual ya 
suffix), this is now correctly mentioned in the word-by-word annotation 
displayed online.



  Links to updated academic publications on the Quranic Arabic Corpus: 2 LREC 
papers, INFOS 2010 paper, a FAL book chapter, and a submission to LRE Journal, 
together with a link to an online review of the Quranic Arabic Corpus at 
Examiner.com. The full versions of these papers are now available as PDF 
downloads from the Quranic Arabic Corpus website. These publications and 
articles explain in detail the original research contributions of the Quranic 
Arabic Corpus project.



  Improved online documentation for the corpus, and additional sections in the 
online annotation guidelines, most notably a new detailed section on the 
different types of verb forms in Quranic Arabic morphology.



  Enhanced morphological search for the Quran, including the ability to search 
on additional part-of-speech tags and linguistic features.



  Version 0.3 of the reviewed morphologically annotated data is freely 
available for download from the Quranic Arabic Corpus website.



  The Quranic Arabic Corpus is an open source project. Contributions or 
questions about the research are more than welcome. Please direct any 
correspondence to Kais Dukes, PhD researcher at the School of Computing, 
University of Leeds:



  web: www.kaisdukes.com

  e-mail: s...@leeds.ac.uk




  END RELEASE NOTES





  ========================================

Re: Please respond – A better wa y to organize the Quranic Arabic Corpus dictionary for version 0 .3?

Reply via email to