Hi Waleed,
At the moment the work I am doing on parsing the Quran statistically is very
much still in the experimental stages, although I can say that I have now
replaced the previous rule-based parser with this new statistical parser to
construct the treebank. With regards to your questions:
1) The parser is similar to MaltParser (it uses a shift/reduce stack/queue
algorithm), but has been designed with a particular grammatical formalism in
mind (see http://corpus.quran.com/documentation/dependencygraph.jsp), i.e.
it supports non-terminal phrase nodes directly.
2) Nothing is available for download right now. This is currently an
internal project to assist with construction of the Quranic Arabic Treebank,
BUT hopefully I will make it available for download at some stage. I later
plan to investigate applying the parsing algorithm to other datasets /
languages to see the results.
3) I would be interested to know your results from using MADA, MaltParser,
MSTParser, ... or other depedency parsers. How is your own research and
results coming along?
Looking forward to hearing from you.
Kind Regards,
-- Kais
On Thu, Sep 16, 2010 at 1:39 PM, Waleed Oransa wora...@gmail.com wrote:
Hello Kais,
This is very good enhancement and results. May I know what is
the difference between your parser and other statistical based parser like
MADA?
Also is yours available for download ? or do you plan to do that in the
future?
I am working on MT and Automatic Tashkeel for Arabic and I am interested to
utilize your parser in these researches.
Best regards,
Waleed
On Sun, Sep 12, 2010 at 12:54 PM, Kais Dukes k...@kaisdukes.com wrote:
Hello Eric,
Some very exciting news … well at least exciting to me :-) Please accept
my apologies for not being very responsive on e-mail recently, but I had
locked myself in my study most evenings after coming home from work to
concentrate on something that I have found most interesting. For the past 12
months, development of the Quranic Arabic Dependency Treebank (
http://corpus.quran.com/treebank.jsp) has been slow involved me going
through the following steps repeatedly:
1. Use a hand-written rule based parser to produce an initial draft
syntactic analysis of a verse of the Quran, e.g. see:
http://corpus.quran.com/treebank.jsp?chapter=67
2. Correct the output of the parser and add the resulting proofread verse
to the treebank.
3. Potentially improve the parser’s accuracy by reviewing its rules
against the new larger set of data in the Treebank. Improving the
hand-written parser has been a costly exercise, involving the addition of
new grammar rules and refining these many times over. However, the parser
had performed well. Run against the current draft treebank covering approx.
20% of the Quran, the rule-based parser is 78.79% accurate in terms of it's
automatic grammatical analysis using traditional Arabic dependency grammar:
*
*
*Rule-based parser ... F-measure 78.79%* (precision=90.13%,
recall=69.99%)
Over the last few weeks I have been looking into moving away from the
rule-based parser and starting to a use probabilistic parser, trained
statistically via machine learning. This new parser automatically reads the
existing treebank and learns how to perform syntactic analysis for the
rest of the Quran automatically. Amazingly, I am very excited to announce
that I have found way to recast the problem of syntactic analysis in
traditional Arabic grammar as a statistical classification problem
(following a similar idea to Nivre’s dependency parsing algorithm). The
results for the new parser using machine learning are:
*
*
*Statistical parser ... F-measure 87.87%* (precision=90.02%,
recall=85.82%)
Not only is this a big jump in accuracy (from 79% to 88%), the parser only
takes 15 seconds to train on the existing Treebank, compared to many months
of development time for the rule based parser refining hand-crafted
constraint dependency rules. I am very excited about this! Immediately, what
comes to mind is:
1) We are now using a data-driven statistical parser using
machine-learning, with accuracy comparable to state-of-the-art statistical
parsers for dependency grammar.
2) The improved accuracy of the new parser means that continuing to
develop the syntactic treebank will be quicker since the resulting output is
now much more accurate, and also from reviewing the new syntactic analyses
they also appear to be more consistent.
3) Completion of the treebank should also now move faster because I have
to spend less effort on the time-consuming task for building a rule-based
parser by hand, and I can spend more time on ensuring accuracy by
proofreading the automatic syntactic analyses.
4) This should lead to a stronger journal paper submission on statistical
dependency parsing of Quranic Arabic. In fact, I am so excited about this
that I am keen to start working on this paper as soon as I have got the FAL