Hi Michael, Thank you so much, i think i have found the solution to implement Asynchronous LSParser and parseWithContext for Xerces, and i am sure i can finish it well, the most important thing is that i am interested in XML parsing job,that gives me power. I am really really looking forward to one of Xerces committers :-)
In addition, here is my submited proposal,if you have time, i am looking forward to any suggestions from you ------------------------------------------------------------------------------------------------------------------------------------- Project Title:Implement Xerces' Asynchronous LSParser and parseWithContext StudentName: Yin Lei Student Email: [email protected] Organization/Project:Apache Foundation/Xerces Assigned Mentor:Michael Glavassevich Proposal Abstract: Apache Xerces2 is a powful XML parser,at present, it implements a collection of standard APIs for XML processing,though Xerces has a functional DOM Level 3 LSParser,but there are a couple parts of the spec which still need to be implemented.This project will provide an asynchronous version for LSParse which returns from the parse method immediately and builds the DOM tree on another thread as well as implementing the function parseWithContext which allows a document fragment to be parsed and attached to an existing DOM. Detailed Description: Apache Xerces-J is a high-performance, standard complaint processor written in Java for parsing, validating, serializing and manipulating XML documents. It provides a complete implementation of the Document Object Model Level 3 Core and Document Object Model Level 3 Load and Save Recommendations,but Xerces' implemention of LSParser has two limitations(http://xerces.apache.org/xerces2-j/dom3.html): not support asynchronous LSParser which returns from the parse method immediately and builds the DOM tree on another thread. not support the function parserWithContext of interface LSParser which parse an XML fragment from a resource identified by a LSInput and insert the content into an existing document at the position specified with the context and action arguments. In order to solve these two limitations, i have been researching W3C's recommendation specification about LSParser and in the meantime, i have downloaded Xerces2-J's source code,import it to my Eclipse workspace, look it over and over and consider how to implements these two specifications.At the same time,i discuss the subject with Xerces' developers(You help me a lot,thank you,especially dear Michael Glavassevich).Now,i have found some ideas about the solution and did some experiments to check my solution,this is only a global solution,and i neglect some details. interface DOMImplementationLS,Class org.apache.xerces.dom.CoreDOMImplementationImpl implements the interface. As described in W3C's recommendation, DOMImplementationLS's implemention should supply a function createLSParser which can create synchronous LSParser as well as asynchronous LSParser,but now, we can only get the former using CoreDOMImplementationImpl's function createLSParser. So,i should fix this problem. interface LSParser,Class org.apache.xerces.parsers.DOMParserImpl implements the interface, but absolutely,it supports synchronous model only,even the function getAsync in it directly return false. There is my solution to provide an asynchronous version for LSParser. Step one : DOMParserImpl implements interface EventTarget as well as interface LSParser. It use a Vector ojbect (we name it repository) to store all the action listeners registered in to the current LSParser object. Each of listeners is made up of three parts,type,useCapture and event handler function,there are only two types of event,load and progress. My following task is to implement function addEventListener,dispatchEvent and removeEventListener. addEventListener : just add a action listener object in to repository.We should notice that listener with the same parameters can only be added once. dispatchEvent : traverse each item of repository,if some one has the same type value with the event and its useCapture value is true,let's dispatch its handleEvent function. removeEventListener : traverse each item of repository,if some one is the same as the object in the parameter,just remove this item from repository. Step two : implement interface LSLoadEventIn asynchronous LSParser, LSLoadEvent is used to inform the parser that the parse function has finished parse job. We can achieve it by dispatching LSParser's dispatchEvent function which will receive LSLoadEvent as a parameter. Step three : implement interface LSProgressEventIn asynchronous LSParser,the parse thread will trigger a LSProgressEvent when it finish a entity node parsing job,the triggered LSProgressEvent will tell LSParser current parse position. If it can see more external resource reference, it may also change totalSize value. Step four : implement asynchronous mechanismDOMParserImpl has a attribute which mark its model,synchronous or asynchronous. We can get its parse model from the function getAsynoc. If the parser is in asynchronous, when LSParser instance's parse() function is dispatched,set busy value true, start a Thread to parse XML document in LSInput,and then return null value. When XML parse thread finish its parse job,set busy value false,create a LSLoadEvent instance with type value load,dispatch function dispatchEvent(Event evt).If user register any actionlistener for load event,dispatchEvent function will finish jobs defined in actionlistener's handleEvent function. function parseWithContext(LSInput input, Node contextArg, short action) This function parse an XML fragment from a resource identified by a LSInput and insert the content into an existing document at the position specified with the context and action arguments. This XML fragment is a special data structure, I need contruct a new class named XMLFragment to store it. Then, i should do the following jobs: Parse the XML fragment into a XMLFragment object,mark it whether a complete XML document, any error happens,throw an exception. These classes can help me: a . org.apache.xerces.impl.XMLDocumentFragmentScannerImpl b . org.apache.xerces.impl.XMLDocumentScannerImpl.FragmentContentDispatcher c . org.apache.xerces.impl.XMLEntityScanner I can use some functions in these classes and start the parsing job by consult function startEntity in class org.apache.xerces.impl.XMLDocumentScannerImpl and function scanDocument in class org.apache.xerces.impl.XMLDocumentFragmentScannerImpl. Here is the basic implement idea (in fact, this is a recursion process): Create a XMLFragment instance,there is a very importmant attribute in it, we named it fCurrentNode; Start read characters from the LSInput stream.If catch Node start character such as "<" from the input stream,go tostep 3; If catch Node end character such as "/" from the input stream,go to step 4; If catch file end up character,go to step 5. A begin EVENT happens, usually, instance a Node object,append the Node instance in to fCurrentNode's child node list,change fCurrentNode to this Node,then go to step 2. A end EVENT happens,usually,we should change fCurrentNode to its father node,then go to step 2. Parsing job ends. Start add the XMLFragment in to the place indicated by the parameter action. In this phase, we have lots of validate jobs to do, including four aspects: Basic Validation, Namespace Validate, DTD Validation and Schema Validateoin. Basic Validation: I should validate whether this merge job is legal, for example,if the context Node is document root element, and the parameter action is not ACTION_REPLACE_CHILDREN,in this situation,an error should be thrown up. I should confirm the merge result XML document is well-formed,for example, the DOM should have only one root element and Entity declaration must be at the beginning part of the document etc. Namespace Validation: I should validate both Element namespace and Attribute namespace of the merge result XML document DTD Validationa: Validate whether the merge result XML document is in keeping with Element Type Declarationb. Validate whether the merge result XML document is in keeping with Entity Declarationc. Validate whether the merge result XML document is in keeping with Attribute Declaration, for example, if DTD file includes default attribute declaration,i should add default attributes for the elements which are root in LSInput Fragment. Schema Validation,This section includes validations demands in DTD Validation, and it has some more validation requests: Validate data type of elements and attributes Three kinds of annotation declaration validation If everything is OK,return the result Node,otherwise if an error occurs, the caller is notified through the ErrorHandler instance associated with the "error-handler" parameter of the DOMConfiguration.As the new data is inserted into the document, at least one mutation event is fired per new immediate child or sibling of the context node. Additional Information: My development plan: 1st week in 1st month(May 24 - Jun 1)Read Xerces-J source code and get familiar with its architecture,thus what I have done will comply with its philosophy 2st week in 1st month(Jun 1 - Jun 8) Do some change job to DOMImplementationLS and DOMParserImpl,make DOMImplementationLS can create asynchronous LSParser and add some basic attribute for DOMParserImpl such as asynchronous flag and so on 3st week in 1st month(Jun 9 - Jun 16) Construct DOMParserImpl's structure to implement interface EventTarget,implement addActionListener,dispatchEvent and removeActionListener 4st week in 1st month(Jun 17 - Jun 24) Implement LSParser parse() and parseURI() function, add asynochronous support implement LSParser function abort() implement LSParser function getAsync() implement LSParser function getBusy() 1st week in 2st month(Jun 25 - Jul 2) Implement interface LSLoadEvent and LSProgressEvent,finish the whole asynchronous parse cycle and some unit test 2st week in 2st month(Jul 3- Jul 10) finish sub task of function parseWithContext() -- parse the LSInput into a XMLFragment instance 3st week in 2st month(Jul 11- Jul 18) start merge context Node and XML fragment document,finish Basic Validation and Namespace Validation 4st week in 2st month(Jul 19- Jul 26) finish the merge job of context DOM tree and the XMLFragment,finish DTD Validation and Schema Validation 1st week in 3rd month(Jul 27- Aug 3) Test My asynchronous LSParser and function parseWithContext last 2 weeks in 3rd month(Aug 3 - Aug 20) submit all codes and documents Who i am ? Hi,everyone,My name is Yin Lei. I am a postgraduate student of University of Science and Technology Beijing,China. My major is computer scienece and technology. During my six years Java development experience, Apache help me so much, many projects such as Struts,Tomcat,Xerces,Xalan,HttpClient,Common FileUpload,JavaMail,POI play important part of my research projects. So, i am eager to participate in open source community and become a long term commiter of that project, in my daily work, i use Xerces as my XML parser, so, i found its lacking and want to improve it to make it perfect :-) My work experience and relative rewards: 2007.7 - 2008.5 : work in SUN Microsystem Inc. as a intern 2008.7 - 2009.12 : work in IBM China Development Laborary as a intern 2008.9 : won excellent team member of 2008 IBM blue pathway program 2009.11: won Lotus Innovation Award of IBM Asia Pacific Also,i did some open source job before,the first experience I had in open source development is building a Eclipse plugin for Apache SCXML engine, and also attempt to add a new feature for SCXML engine to make it support multi-thread operation.I can code in C++, Java and some script language such as JaveScript and ActionScript. In addition to these things, I'm familiar with XML,DOM,SAX,JDOM and Dom4j,I want to improve existing XML parsing tools through my job. 2010-03-30 xiaohei.leiyin 发件人: Michael Glavassevich 发送时间: 2010-03-30 20:10:21 收件人: xiaohei.leiyin 抄送: 主题: Re: GSoC proposal about "Asynchronous LSParser and parseWithContext " Hi Yin, Yes, that's fine. If your proposal is accepted for GSoC I would mentor you and I think that's what they're looking for there on the GSoC site. There are usually several hundred proposals submitted to Apache every year for the various projects across the organization. It can be very competitive depending on the number of spots that Google actually awards to Apache and the number of good proposals submitted by students. I wish you good luck in the selection process. Thanks. Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: [email protected] E-mail: [email protected] "xiaohei.leiyin" <[email protected]> wrote on 03/30/2010 01:02:31 AM: > Dear Michael, > > I have modify my proposal follow your advise, and submit it in the > GSoC web site, i noticed that there is a item "Organization/ > Project:Assigned Mentor:" in the content section of the proposal > submit page. So, can i fill it "Organization/Project:Apache > Foundation/Xerces Assigned Mentor:Michael Glavassevich", is it ok ? > I mean that can i take you as my assigned mentor ? If you think it > is ok, i will maintain, if you do not like it due to some > reasons,please let me know, i will alter it ( it is ok, i must > respect you, in Chinese culture,you are my teacher already,respct > teach is a Chinese culture of long standing and well established, we > call it 尊师重教[zun shi zhong jiao] ). > > During these days,when i was researching Xerces' architect and > discovering how to implement Asynchronous LSParser and > parseWithContext for Xerces, i found i got lots of knowledge, made a > full-grown progress. When i came across some difficulties, you > helped me a lot, in fact ,you are my mentor in my heart. Thank you > so so so so much ! I think i have won knowledge no matter GSoC > receive my proposal or not, i will finish this project, once i began > it, i want to finish it, for you, for open source. > > Your student : Yin Lei from China > > 2010-03-30 > > xiaohei.leiyin
