Hi Michael,

Today i edit my proposal again and got a new version, i will submit it in the 
GSoC web site soon. Now, i sent it to you first, if you have any advises or 
suggestions, just let me know, thank you. I hope i can finshed this project 
well,do some thing, or even do more thing for Xerces :-)


2010-03-29 



xiaohei.leiyin 



发件人: Michael Glavassevich 
发送时间: 2010-03-29  11:51:01 
收件人: [email protected] 
抄送: 
主题: Re: GSoC proposal about "Asynchronous LSParser and parseWithContext " 
 
Hi Yin,

Yin Lei <[email protected]> wrote on 03/25/2010 04:53:16 AM:

> Hi Michael,
>  
> About the function parseWithContext(LSInput input,  Node 
> contextArg,short action), there is a point i am not so clear.
>  
> If LSInput contains following content:
>  
> <?xml version="1.0" encoding="UTF-8"?>
> <element id="1">element_one</element>
> <element id="2">element_two</element>
>  
> For a LSInput, is it well-format or legal ? Or we could just neglect
> XML declation ?

It matches the production [1] for well-formed external parsed entities so I 
would say yes it's allowed. That's a text declaration [2] by the way, not an 
XML declaration.

> If this input is legal,action is ACTION_INSERT_AFTER and contextArg 
> is a DOM element has the following content:
>  
> <contextnode>content here</contextnode>
>  
> Should we return this DOM Node ?
>  
> <contextnode>content here</contextnode>
> <element id="1">element_one</element> 
> <element id="2">element_two</element>

As long as the parent of "contextnode" is an Element or a DocumentFragment that 
is the correct result.
  
> Thank you and expceting your reply

Thanks.

[1] http://www.w3.org/TR/2006/REC-xml-20060816/#NT-extParsedEnt
[2] http://www.w3.org/TR/2006/REC-xml-20060816/#NT-TextDecl

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [email protected]
E-mail: [email protected] 

Google Summer of Code 2010 - Project Proposal

Project

Implement Xerces' Asynchronous LSParser and parseWithContext

Student Name

Yin Lei

Email

[email protected]

Time zone

UTC+8 (China)


Abstract

Apache Xerces2 is a powful XML parser,at present, it implements a collection of standard APIs for XML processing,though Xerces has a functional DOM Level 3 LSParser,but there are a couple parts of the spec which still need to be implemented.This project will provide an asynchronous version for LSParse which returns from the parse method immediately and builds the DOM tree on another thread as well as implementing the function parseWithContext which allows a document fragment to be parsed and attached to an existing DOM.

Description

Apache Xerces-J is a high-performance, standard complaint processor written in Java for parsing, validating, serializing and manipulating XML documents. It provides a complete implementation of the Document Object Model Level 3 Core and Document Object Model Level 3 Load and Save Recommendations,but Xerces' implemention of LSParser has two limitations(http://xerces.apache.org/xerces2-j/dom3.html):

1.not support asynchronous LSParser which returns from the parse method immediately and builds the DOM tree on another thread.

2.not support the function parserWithContext of interface LSParser which parse an XML fragment from a resource identified by a LSInput and insert the content into an existing document at the position specified with the context and action arguments.

In order to solve these two limitations, i have been researching W3C's recommendation specification about LSParser and in the meantime, i have downloaded Xerces2-J's source code,import it to my Eclipse workspace, look it over and over and consider how to implements these two specifications.At the same time,i discuss the subject with Xerces' developers(You help me a lot,thank you,especially Michael Glavassevich).Now,i have found some ideas about the solution and did some experiments to check my solution,this is only a global solution,and i neglect some details.

1.interface DOMImplementationLS

Class org.apache.xerces.dom.CoreDOMImplementationImpl implements the interface. As described in W3C's recommendation, DOMImplementationLS's implemention should supply a function createLSParser which can create synchronous LSParser as well as asynchronous LSParser,but now, we can only get the former using CoreDOMImplementationImpl's function createLSParser. So,i should fix this problem.

2.interface LSParser

Class org.apache.xerces.parsers.DOMParserImpl implements the interface, but absolutely,it supports synchronous model only,even the function getAsync in it directly return false. There is my solution to provide an asynchronous version for LSParser.

  • Step one : DOMParserImpl implements interface EventTarget as well as interface LSParser.

    It use a Vector ojbect (we name it repository) to store all the action listeners registered in to the current LSParser object. Each of listeners is made up of three parts,type,useCapture and event handler function,there are only two types of event,load and progress. My following task is to implement function addEventListener,dispatchEvent and removeEventListener.
  • addEventListener : just add a action listener object in to repository.We should notice that listener with the same parameters can only be added once.

    dispatchEvent : traverse each item of repository,if some one has the same type value with the event and its useCapture value is true,let's dispatch its handleEvent function.

    removeEventListener : traverse each item of repository,if some one is the same as the object in the parameter,just remove this item from repository.

  • Step two : implement interface LSLoadEvent

  • In asynchronous LSParser, LSLoadEvent is used to inform the parser that the parse function has finished parse job. We can achieve it by dispatching LSParser's dispatchEvent function which will receive LSLoadEvent as a parameter.

  • Step three : implement interface LSProgressEvent

    In asynchronous LSParser,the parse thread will trigger a LSProgressEvent when it finish a entity node parsing job,the triggered LSProgressEvent will tell LSParser current parse position. If it can see more external resource reference, it may also change totalSize value.

  • Step four : implement asynchronous mechanism

    DOMParserImpl has a attribute which mark its model,synchronous or asynchronous. We can get its parse model from the function getAsynoc. If the parser is in asynchronous, when LSParser instance's parse() function is dispatched,set busy value true, start a Thread to parse XML document in LSInput,and then return null value. When XML parse thread finish its parse job,set busy value false,create a LSLoadEvent instance with type value load,dispatch function dispatchEvent(Event evt).If user register any actionlistener for load event,dispatchEvent function will finish jobs defined in actionlistener's handleEvent function.
  • 3.function parseWithContext(LSInput input, Node contextArg, short action)

    This function parse an XML fragment from a resource identified by a LSInput and insert the content into an existing document at the position specified with the context and action arguments. This XML fragment is a special data structure, I need contruct a new class named XMLFragment to store it. Then, i should do the following jobs:

    1). Parse the XML fragment into a XMLFragment object,mark it whether a complete XML document, any error happens,throw an exception. These classes can help me:
    a . org.apache.xerces.impl.XMLDocumentFragmentScannerImpl
    b . org.apache.xerces.impl.XMLDocumentScannerImpl.FragmentContentDispatcher
    c . org.apache.xerces.impl.XMLEntityScanner
    I can use some functions in these classes and start the parsing job by consult function startEntity in class org.apache.xerces.impl.XMLDocumentScannerImpl and function scanDocument in class org.apache.xerces.impl.XMLDocumentFragmentScannerImpl. Here is the basic implement idea (in fact, this is a recursion process):
    I .Create a XMLFragment instance,there is a very importmant attribute in it, we named it fCurrentNode;
    II.Start read characters from the LSInput stream.If catch Node start character such as "<" from the input stream,go to step III; If catch Node end character such as "/" from the input stream,go to step IV; If catch file end up character,go to step V.
    III. A begin EVENT happens, usually, instance a Node object,append the Node instance in to fCurrentNode's child node list,change fCurrentNode to this Node,then go to step II.
    IV: A end EVENT happens,usually,we should change fCurrentNode to its father node,then go to step II.
    V: Parsing job ends.

    2). Start add the XMLFragment in to the place indicated by the parameter action. In this phase, we have lots of validate jobs to do, including four aspects: Basic Validation, Namespace Validate, DTD Validation and Schema Validateoin.

    I. Basic Validation:
    a.I should validate whether this merge job is legal, for example,if the context Node is document root element, and the parameter action is not ACTION_REPLACE_CHILDREN,in this situation,an error should be thrown up.
    b. I should confirm the merge result XML document is well-formed,for example, the DOM should have only one root element and Entity declaration must be at the beginning part of the document etc.

    II. Namespace Validation
    I should validate both Element namespace and Attribute namespace of the merge result XML document

    III.DTD Validation
    a. Validate whether the merge result XML document is in keeping with Element Type Declaration
    b. Validate whether the merge result XML document is in keeping with Entity Declaration
    c. Validate whether the merge result XML document is in keeping with Attribute Declaration, for example, if DTD file includes default attribute declaration,i should add default attributes for the elements which are root in LSInput Fragment.

    IV.Schema Validation
    This section includes validations demands in DTD Validation, and it has some more validation requests:
    a.Validate data type of elements and attributes
    b.Three kinds of annotation declaration validation

    If everything is OK,return the result Node,otherwise if an error occurs, the caller is notified through the ErrorHandler instance associated with the "error-handler" parameter of the DOMConfiguration.As the new data is inserted into the document, at least one mutation event is fired per new immediate child or sibling of the context node.

    Things I have done so far

    I have been studying W3C's recommendation specification about LSParser,parseWithContext and their relative specifications,caught some problems,discussed these problems with Xerces' developers in Xerces' develop mail list.In order to finish the project,i have to know everything and any detail about LSParser. In addition to that, I started to read the literature, specially other related W3C standards, various tutorials etc, that would be helpful for this project. At the same time, I checked out and built the Xerces trunk and then I tried out some samples and tests and started to study the code. In the future, if i want my codes to be one part of Xerces-J,i must keep the same coding standards and styles that have been used and the package structure etc. I want to look over every class package in Xerces' source project, get to know each class's function,this is a long term study process and i have not finished it yet. In Xerces' develop mail and user mail list,I filter existing issues of Xerces and searched if there are issues related to LSParser in JIRA.

    Development Schedule

    March 18 - March 29

    Discussing the project idea with the community to get suggestions, feedback, etc.

    March 29 - April 9

    Submitting the project proposal

    April 26 - May 24

    Accepted student proposals announced on the Google Summer of Code 2010 site.

    Community Bonding Period (April 26 - May 24): Get to know mentors, read documentation, and prepare development environment.

    1st week in 1st month
    (May 24 - Jun 1)

    Read Xerces-J source code and get familiar with its architecture,thus what I have done will comply with its philosophy

    2st week in 1st month
    (Jun 1 - Jun 8)

    Do some change job to DOMImplementationLS and DOMParserImpl,make DOMImplementationLS can create asynchronous LSParser and add some basic attribute for DOMParserImpl such as asynchronous flag and so on

    3st week in 1st month
    (Jun 9 - Jun 16)

    Construct DOMParserImpl's structure to implement interface EventTarget,implement addActionListener,dispatchEvent and removeActionListener

    4st week in 1st month
    (Jun 17 - Jun 24)

    Implement LSParser parse() and parseURI() function, add asynochronous support
    implement LSParser function abort()
    implement LSParser function getAsync()
    implement LSParser function getBusy()

    1st week in 2st month
    (Jun 25 - Jul 2)

    Implement interface LSLoadEvent and LSProgressEvent,finish the whole asynchronous parse cycle and some some unit test

    2st week in 2st month
    (Jul 3- Jul 10)

    finish sub task of function parseWithContext() -- parse the LSInput into a XMLFragment instance

    3st week in 2st month
    (Jul 11- Jul 18)

    start merge context Node and XML fragment document,finish Basic Validation and Namespace Validation

    4st week in 2st month
    (Jul 19- Jul 26)

    finish the merge job of context DOM tree and the XMLFragment,finish DTD Validation and Schema Validation

    1st week in 3rd month
    (Jul 27- Aug 3)

    last week of development. Refine code, write tests, etc.

    last 2 weeks in 3rd month
    (Aug 3 - Aug 20)  

    improve documents,submit all my work

    Deliverables

    • Sourcecode
    • Required patches if any
    • A collection of tests that can be used to verify the functionality of asynchronous LSParser and function parseWithContext
    • Documentation

    Community Interaction

    I have subscribed to both Xerces users list and development list and I posted couple of times when I came across difficulties in installing and using Xerces and reading Xerces source code. I also used the development list to introduce my interest in implementing LSParser and parseWithContext as a project,and discucss some confusing W3C recommendation specification. Even before that, I tried to communicate with last year's GSoC students and mentors,they gave me some good advises about how to prepare for GSOC open source project.Apart from that, I used the mailing list whenever possible to clarify the doubts by asking questions from the experts. Specially, some open source pionneers and mentors help me so much(here,to my honest,thank you so much). Feed back that I received on my draft project proposal from mail list was so useful for me in creating this final project proposal. In the future also I'm expecting to use the mailing lists to clarify issues I find and to receive suggestions and feedback for my work from the expert developers and to get them involved in the design decisions of the project as well. I'm also expecting to maintain an excellent communication with my mentor via email and IM.

    About me

    Hi,My name is Yin Lei. I am a postgraduate student of University of Science and Technology Beijing,China. My major is computer scienece and technology. During my six years Java development experience, Apache help me so much, many projects such as Struts,Tomcat,Xerces,Xalan,HttpClient,Common FileUpload,JavaMail,POI play important part of my research projects. So, i am eager to participate in open source community and become a long term commiter of that project, i help GSoC may help me,introduce me to the open source projects.With this project, I'm hoping to obtain a better understanding about the Xerces architecture and to improve my knowledge on it by experimenting with it's code base and above everything, to implement the missing features for it that has just reached it's W3C candidate recommendation. At the same time, I'm hoping to improve my programming and communication skills and to learn more about XML and various technologies.

    My work experience and relative rewards:
    2007.7 - 2008.5 : work in SUN Microsystem Inc. as a intern
    2008.7 - 2009.12 : work in IBM China Development Laborary as a intern
    2008.9 : won excellent team member of 2008 IBM blue pathway program
    2009.11: won Lotus Innovation Award of IBM Asia Pacific

    My experience in open source development:
    The first experience I had in open source development is building a Eclipse plugin for Apache SCXML engine which was a visualizing tool for navigating and editing complex SCXML state description XML, and also attempt to add a new feature for SCXML engine to make it support multi-thread operation.Nevertheless, I have not submitted my codes to Common SCXML though I want to. I can code in C++, Java and some script language such as JaveScript and ActionScript. In addition to these things, I'm familiar with XML,DOM,SAX,JDOM and Dom4j,I want to improve existing XML parsing tools through my job. I always use free and open source software in my academic and development work and I encourage my colleagues to use free software alternatives whenever they can.

    References and Resources

    [1]W3C recommendation specification about DOM Level 3 Load and Save LSParser: http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSParser

    [2]W3C recommendation specification about LSParser function parseWithContext: http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSParser-parseWithContext

    [3]Apache Xerces2-j Home page: http://xerces.apache.org/xerces2-j

    [4]DOM Level 3 Load and Save limitations about LSParser and parseWithContext:http://xerces.apache.org/xerces2-j/dom3.html

    [5]Document Object Model Level 3 Core recommendation: http://www.w3.org/TR/DOM-Level-3-Core/

    [6]Document Object Model Level 3 Load and Save Recommendation: http://www.w3.org/TR/DOM-Level-3-LS/

     

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    Reply via email to