Hello Eric,

Some very exciting news … well at least exciting to me :-) Please accept my
apologies for not being very responsive on e-mail recently, but I had locked
myself in my study most evenings after coming home from work to concentrate
on something that I have found most interesting. For the past 12 months,
development of the Quranic Arabic Dependency Treebank (
http://corpus.quran.com/treebank.jsp) has been slow involved me going
through the following steps repeatedly:


1. Use a hand-written rule based parser to produce an initial draft
syntactic analysis of a verse of the Quran, e.g. see:
http://corpus.quran.com/treebank.jsp?chapter=67


2. Correct the output of the parser and add the resulting proofread verse to
the treebank.


3. Potentially improve the parser’s accuracy by reviewing its rules against
the new larger set of data in the Treebank. Improving the hand-written
parser has been a costly exercise, involving the addition of new grammar
rules and refining these many times over. However, the parser had performed
well. Run against the current draft treebank covering approx. 20% of the
Quran, the rule-based parser is 78.79% accurate in terms of it's automatic
grammatical analysis using traditional Arabic dependency grammar:

*
*

*Rule-based parser ... F-measure 78.79%* (precision=90.13%, recall=69.99%)


Over the last few weeks I have been looking into moving away from the
rule-based parser and starting to a use probabilistic parser, trained
statistically via machine learning. This new parser automatically reads the
existing treebank and "learns" how to perform syntactic analysis for the
rest of the Quran automatically. Amazingly, I am very excited to announce
that I have found way to recast the problem of syntactic analysis in
traditional Arabic grammar as a statistical classification problem
(following a similar idea to Nivre’s dependency parsing algorithm). The
results for the new parser using machine learning are:

*
*

*Statistical parser ... F-measure 87.87%* (precision=90.02%, recall=85.82%)


Not only is this a big jump in accuracy (from 79% to 88%), the parser only
takes 15 seconds to train on the existing Treebank, compared to many months
of development time for the rule based parser refining hand-crafted
constraint dependency rules. I am very excited about this! Immediately, what
comes to mind is:


1) We are now using a data-driven statistical parser using machine-learning,
with accuracy comparable to state-of-the-art statistical parsers for
dependency grammar.


2) The improved accuracy of the new parser means that continuing to develop
the syntactic treebank will be quicker since the resulting output is now
much more accurate, and also from reviewing the new syntactic analyses they
also appear to be more consistent.


3) Completion of the treebank should also now move faster because I have to
spend less effort on the time-consuming task for building a rule-based
parser by hand, and I can spend more time on ensuring accuracy by
proofreading the automatic syntactic analyses.


4) This should lead to a stronger journal paper submission on statistical
dependency parsing of Quranic Arabic. In fact, I am so excited about this
that I am keen to start working on this paper as soon as I have got the FAL
submission out of the way.


5) I now intend to rework the PhD project plan to include this updated
information.


Looking forward to hearing from you! I hope it's okay, I have CC'd the
comp-quran mailing list, I would keen to here from others who have an
interest in, or experience with, statistical parsing. Any comments are most
welcome.


Kind Regards,


-- Kais

Reply via email to