Hello Eric,
Some very exciting news … well at least exciting to me :-) Please accept my apologies for not being very responsive on e-mail recently, but I had locked myself in my study most evenings after coming home from work to concentrate on something that I have found most interesting. For the past 12 months, development of the Quranic Arabic Dependency Treebank ( http://corpus.quran.com/treebank.jsp) has been slow involved me going through the following steps repeatedly: 1. Use a hand-written rule based parser to produce an initial draft syntactic analysis of a verse of the Quran, e.g. see: http://corpus.quran.com/treebank.jsp?chapter=67 2. Correct the output of the parser and add the resulting proofread verse to the treebank. 3. Potentially improve the parser’s accuracy by reviewing its rules against the new larger set of data in the Treebank. Improving the hand-written parser has been a costly exercise, involving the addition of new grammar rules and refining these many times over. However, the parser had performed well. Run against the current draft treebank covering approx. 20% of the Quran, the rule-based parser is 78.79% accurate in terms of it's automatic grammatical analysis using traditional Arabic dependency grammar: * * *Rule-based parser ... F-measure 78.79%* (precision=90.13%, recall=69.99%) Over the last few weeks I have been looking into moving away from the rule-based parser and starting to a use probabilistic parser, trained statistically via machine learning. This new parser automatically reads the existing treebank and "learns" how to perform syntactic analysis for the rest of the Quran automatically. Amazingly, I am very excited to announce that I have found way to recast the problem of syntactic analysis in traditional Arabic grammar as a statistical classification problem (following a similar idea to Nivre’s dependency parsing algorithm). The results for the new parser using machine learning are: * * *Statistical parser ... F-measure 87.87%* (precision=90.02%, recall=85.82%) Not only is this a big jump in accuracy (from 79% to 88%), the parser only takes 15 seconds to train on the existing Treebank, compared to many months of development time for the rule based parser refining hand-crafted constraint dependency rules. I am very excited about this! Immediately, what comes to mind is: 1) We are now using a data-driven statistical parser using machine-learning, with accuracy comparable to state-of-the-art statistical parsers for dependency grammar. 2) The improved accuracy of the new parser means that continuing to develop the syntactic treebank will be quicker since the resulting output is now much more accurate, and also from reviewing the new syntactic analyses they also appear to be more consistent. 3) Completion of the treebank should also now move faster because I have to spend less effort on the time-consuming task for building a rule-based parser by hand, and I can spend more time on ensuring accuracy by proofreading the automatic syntactic analyses. 4) This should lead to a stronger journal paper submission on statistical dependency parsing of Quranic Arabic. In fact, I am so excited about this that I am keen to start working on this paper as soon as I have got the FAL submission out of the way. 5) I now intend to rework the PhD project plan to include this updated information. Looking forward to hearing from you! I hope it's okay, I have CC'd the comp-quran mailing list, I would keen to here from others who have an interest in, or experience with, statistical parsing. Any comments are most welcome. Kind Regards, -- Kais