Re: Statistical Dependency Parsing of Quranic Arabic

2010-09-16 Thread Waleed Oransa
Hello Kais,

This is very good enhancement and results. May I know what is
the difference between your parser and other statistical based parser like
MADA?
Also is yours available for download ? or do you plan to do that in the
future?

I am working on MT and Automatic Tashkeel for Arabic and I am interested to
utilize your parser in these researches.

Best regards,
Waleed

On Sun, Sep 12, 2010 at 12:54 PM, Kais Dukes k...@kaisdukes.com wrote:

 Hello Eric,


 Some very exciting news … well at least exciting to me :-) Please accept my
 apologies for not being very responsive on e-mail recently, but I had locked
 myself in my study most evenings after coming home from work to concentrate
 on something that I have found most interesting. For the past 12 months,
 development of the Quranic Arabic Dependency Treebank (
 http://corpus.quran.com/treebank.jsp) has been slow involved me going
 through the following steps repeatedly:


 1. Use a hand-written rule based parser to produce an initial draft
 syntactic analysis of a verse of the Quran, e.g. see:
 http://corpus.quran.com/treebank.jsp?chapter=67


 2. Correct the output of the parser and add the resulting proofread verse
 to the treebank.


 3. Potentially improve the parser’s accuracy by reviewing its rules against
 the new larger set of data in the Treebank. Improving the hand-written
 parser has been a costly exercise, involving the addition of new grammar
 rules and refining these many times over. However, the parser had performed
 well. Run against the current draft treebank covering approx. 20% of the
 Quran, the rule-based parser is 78.79% accurate in terms of it's automatic
 grammatical analysis using traditional Arabic dependency grammar:

 *
 *

 *Rule-based parser ... F-measure 78.79%* (precision=90.13%, recall=69.99%)


 Over the last few weeks I have been looking into moving away from the
 rule-based parser and starting to a use probabilistic parser, trained
 statistically via machine learning. This new parser automatically reads the
 existing treebank and learns how to perform syntactic analysis for the
 rest of the Quran automatically. Amazingly, I am very excited to announce
 that I have found way to recast the problem of syntactic analysis in
 traditional Arabic grammar as a statistical classification problem
 (following a similar idea to Nivre’s dependency parsing algorithm). The
 results for the new parser using machine learning are:

 *
 *

 *Statistical parser ... F-measure 87.87%* (precision=90.02%,
 recall=85.82%)


 Not only is this a big jump in accuracy (from 79% to 88%), the parser only
 takes 15 seconds to train on the existing Treebank, compared to many months
 of development time for the rule based parser refining hand-crafted
 constraint dependency rules. I am very excited about this! Immediately, what
 comes to mind is:


 1) We are now using a data-driven statistical parser using
 machine-learning, with accuracy comparable to state-of-the-art statistical
 parsers for dependency grammar.


 2) The improved accuracy of the new parser means that continuing to develop
 the syntactic treebank will be quicker since the resulting output is now
 much more accurate, and also from reviewing the new syntactic analyses they
 also appear to be more consistent.


 3) Completion of the treebank should also now move faster because I have to
 spend less effort on the time-consuming task for building a rule-based
 parser by hand, and I can spend more time on ensuring accuracy by
 proofreading the automatic syntactic analyses.


 4) This should lead to a stronger journal paper submission on statistical
 dependency parsing of Quranic Arabic. In fact, I am so excited about this
 that I am keen to start working on this paper as soon as I have got the FAL
 submission out of the way.


 5) I now intend to rework the PhD project plan to include this updated
 information.


 Looking forward to hearing from you! I hope it's okay, I have CC'd the
 comp-quran mailing list, I would keen to here from others who have an
 interest in, or experience with, statistical parsing. Any comments are most
 welcome.


 Kind Regards,


 -- Kais




Re: Statistical Dependency Parsing of Quranic Arabic

2010-09-16 Thread Kais Dukes
Hi Waleed,

At the moment the work I am doing on parsing the Quran statistically is very
much still in the experimental stages, although I can say that I have now
replaced the previous rule-based parser with this new statistical parser to
construct the treebank. With regards to your questions:

1) The parser is similar to MaltParser (it uses a shift/reduce stack/queue
algorithm), but has been designed with a particular grammatical formalism in
mind (see http://corpus.quran.com/documentation/dependencygraph.jsp), i.e.
it supports non-terminal phrase nodes directly.

2) Nothing is available for download right now. This is currently an
internal project to assist with construction of the Quranic Arabic Treebank,
BUT hopefully I will make it available for download at some stage. I later
plan to investigate applying the parsing algorithm to other datasets /
languages to see the results.

3) I would be interested to know your results from using MADA, MaltParser,
MSTParser, ... or other depedency parsers. How is your own research and
results coming along?

Looking forward to hearing from you.

Kind Regards,

-- Kais

On Thu, Sep 16, 2010 at 1:39 PM, Waleed Oransa wora...@gmail.com wrote:

 Hello Kais,

 This is very good enhancement and results. May I know what is
 the difference between your parser and other statistical based parser like
 MADA?
 Also is yours available for download ? or do you plan to do that in the
 future?

 I am working on MT and Automatic Tashkeel for Arabic and I am interested to
 utilize your parser in these researches.

 Best regards,
 Waleed

 On Sun, Sep 12, 2010 at 12:54 PM, Kais Dukes k...@kaisdukes.com wrote:

 Hello Eric,


 Some very exciting news … well at least exciting to me :-) Please accept
 my apologies for not being very responsive on e-mail recently, but I had
 locked myself in my study most evenings after coming home from work to
 concentrate on something that I have found most interesting. For the past 12
 months, development of the Quranic Arabic Dependency Treebank (
 http://corpus.quran.com/treebank.jsp) has been slow involved me going
 through the following steps repeatedly:


 1. Use a hand-written rule based parser to produce an initial draft
 syntactic analysis of a verse of the Quran, e.g. see:
 http://corpus.quran.com/treebank.jsp?chapter=67


 2. Correct the output of the parser and add the resulting proofread verse
 to the treebank.


 3. Potentially improve the parser’s accuracy by reviewing its rules
 against the new larger set of data in the Treebank. Improving the
 hand-written parser has been a costly exercise, involving the addition of
 new grammar rules and refining these many times over. However, the parser
 had performed well. Run against the current draft treebank covering approx.
 20% of the Quran, the rule-based parser is 78.79% accurate in terms of it's
 automatic grammatical analysis using traditional Arabic dependency grammar:

 *
 *

 *Rule-based parser ... F-measure 78.79%* (precision=90.13%,
 recall=69.99%)


 Over the last few weeks I have been looking into moving away from the
 rule-based parser and starting to a use probabilistic parser, trained
 statistically via machine learning. This new parser automatically reads the
 existing treebank and learns how to perform syntactic analysis for the
 rest of the Quran automatically. Amazingly, I am very excited to announce
 that I have found way to recast the problem of syntactic analysis in
 traditional Arabic grammar as a statistical classification problem
 (following a similar idea to Nivre’s dependency parsing algorithm). The
 results for the new parser using machine learning are:

 *
 *

 *Statistical parser ... F-measure 87.87%* (precision=90.02%,
 recall=85.82%)


 Not only is this a big jump in accuracy (from 79% to 88%), the parser only
 takes 15 seconds to train on the existing Treebank, compared to many months
 of development time for the rule based parser refining hand-crafted
 constraint dependency rules. I am very excited about this! Immediately, what
 comes to mind is:


 1) We are now using a data-driven statistical parser using
 machine-learning, with accuracy comparable to state-of-the-art statistical
 parsers for dependency grammar.


 2) The improved accuracy of the new parser means that continuing to
 develop the syntactic treebank will be quicker since the resulting output is
 now much more accurate, and also from reviewing the new syntactic analyses
 they also appear to be more consistent.


 3) Completion of the treebank should also now move faster because I have
 to spend less effort on the time-consuming task for building a rule-based
 parser by hand, and I can spend more time on ensuring accuracy by
 proofreading the automatic syntactic analyses.


 4) This should lead to a stronger journal paper submission on statistical
 dependency parsing of Quranic Arabic. In fact, I am so excited about this
 that I am keen to start working on this paper as soon as I have got the FAL