XML has been extremely popular, its memory binding (Programming Model) is hard to ignore. Current memory bindings such as JavaBean, Service Data Objects and Eclipse Modeling Framework, have room to improve efficiency, by streaming data.
Index 1. Many people know DOM is much less efficient than SAX<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#DOM> 2. Current memory bindings are as inefficient as DOM even if SAXed<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#current> 3. While SAX/DOM pushes, StAX pulls which offers an opportunity to load on demand<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#StAX> 4. Loading on demand improves efficiency, a lot<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#OnDemand> - *ZERO* cost scenario - Execution Path - Update only - Insert only - Lower cost scenario 5. Streaming Object<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#object> 6. Streaming List<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#list> 7. Modeling Frameworks<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#modeling> - Modeling-neutral StreamReader - JavaBean - Service Data Objects - Eclipse Modeling Framework 8. Implementation<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#implementation> - StreamObject<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#object>& StreamList<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#list>injection - Code Generation (static object) - Dynamic object - Concurrent access *Many people know DOM is much less efficient than SAX* Document Object Model (org.w3c.dom) is fully populated before available, it costs time and space(memory), that's the perspective it's much less efficient than Simple API for XML (org.xml.sax). *Current memory bindings are as inefficient as DOM even if SAXed* Current memory bindings such as JavaBean, SDO and EMF, are also fully populated before available, even if SAX even StAX is used to populate data from XML into memory data structure, that's the perspective they're as inefficient as DOM because of the cost of both time and space(memory). *While SAX/DOM PUSHes, StAX PULLs which offers an opportunity to load on demand* Streaming API for XML (javax.xml.stream) works completely opposite direction against SAX/DOM from driving perspective. While SAX/DOM parser drives the processing and *PUSH*es data from XML to handlers or directly into memory data structure, StAX processing is driven by demand and demand *PULL*s data out of XML. It offers an opportunity to load on demand if memory bindings themselves drive the StAX processing. *Loading on demand improves efficiency, a lot* - *ZERO* cost scenario - Execution Path execute (Order order,Product fromUpStream) { if( order.paid() ) { fromUpStream.get... fromUpStream.set... toDownStream( fromUpStream); } else toDownStream( fromUpStream); /* *"fromUpStream" does NOT need to be read and parsed at all,* *the data can be DIRECTLY PIPED to down stream,* *NEITHER time NOR space(memory) cost at all* */ } - Update only <complexType name="Product"> <sequence> <element name="Property1" type="int"/> <element name="Property2" type="float" maxOccurs="unbounded"/> ... <element name="Property100" type="date"/> </sequence> </complexType> Given that definition and this instance: <Product> <Property1>1</Property1>> <Property2>2.1</Property2> ... <Property2>2.2000000</Property2> </Product> and this code: execute (Product fromUpStream) { fromUpStream.setProperty100( "2006-06-25"); toDownStream( fromUpStream); } *"fromUpStream" does NOT need to be read and parsed at all, the data can be PIPED to down stream with "<Property100>2006-06-25</Property100>" inserted, NEITHER time NOR space(memory) cost at all.* A more interesting scenario is, given above same instance and this code: execute (Product fromUpStream) { fromUpStream.setProperty1( "3"); toDownStream( fromUpStream); } *"fromUpStream" does NOT need to be read and parsed at all, the data can be PIPED to down stream with "1" ignored and replaced with "3", NEITHER time NOR space(memory) cost at all.* - (Collection) Append only Given above definition and this instance: <Product> <Property2>2.1</Property2> ... <Property2>2.2000000</Property2> </Product> and this code: execute (Product fromUpStream) { fromUpStream.getProperty2().add( 2.2000001); toDownStream( fromUpStream); } *"fromUpStream" does NOT need to be read and parsed at all, the data can be PIPED to down stream with "<Property2>2.2000001</Property2>" inserted, NEITHER time NOR space(memory) cost at all.* - Lower cost scenario Many people know XML is string (human readable) based, while memory binding is binary. The binding has TWO stages: 1. READ literal string out of XML 2. PARSE the literal string to binary The parsing costs time more or less, and sometimes space(memory) depending on complexity and algorithm. Given above definition and this instance: <Product> <Property1>3</Property1>> <Property2>2.0</Property2> <Property2>2.1</Property2> ... <Property2>2.2000000</Property2> <Property100>2006-06-25</Property100> </Product> and this code: execute (Product fromUpStream) { fromUpStream.getProperty2().get( 1); fromUpStream.getProperty1(); toDownStream( fromUpStream); } Since Property2[1] is demanded, the XML instance can be read through "<Property2>2.1</Property2>" and the literal string ("2.1") can be parsed into memory before returning the binary(float). The literal string (" 2.1") itself can also be weakly cached to speed up XML exporting if no more change to Property2[1]. - *The rest of "fromUpStream" do NOT need to be read and parsed at all, they can be PIPED to down stream, both time and space(memory) are spared, simetimes a lot.* - Since the XML processing is streaming instead of random accessing, the data ahead of Property2[1] are read and the literal strings can be stored, however *parsing is NOT required right away, parsing space(memory) if any and time can be spared if NEVER demanded *. Later on whenever Property1 or Property2[0] is ever demanded, the stored literal string can then be parsed into memory before returning the binary. Then the literal string storage can become a weak cache to speed up XML exporting if no more change to the property. Any more change to the property can invalidate the weak cache to release space(memory) initiatively. *The cached literal strings can spare some time of XML exporting without space(memory) sacrifice since references are weak (Java). The stored literal strings (of properties whose values are never demanded) can also spare some time of XML exporting*, as for space(memory) gain/loss, it's case by case since some binaries are less than its literal representation while some others are more. Property accesses include "isSet" and "unset", besides "get" and "set". While "get" demands reading and parsing, "isSet" only needs reading and can defer parsing which may never be demanded. *Streaming Object* Loading on demand is driven by memory binding, however streaming reading may reach other data before the demanded one, so the streaming reading (StreamReader) needs to notify reached literal strings which are not demanded yet. Here's the protocol which can be used to communicate: interface StreamObject<Type,Property,C> { Object get (int propertyID); // StreamList <file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#list> Type getType(); List<Property> getInstanceProperties(); C getContainer(); // StreamObject<Type,Property,?> void set (StreamReader<Type,Property> reader); StreamObject<Type,Property,?> createUnlessRead (int propertyID,QName typeXSI,Type type); void setUnlessRead (int propertyID,String stringPropertyValue); void setLiteralValue (int propertyID,QName typeXSI,String value); Object parseLiteralValue (int propertyID,QName typeXSI,String value,Type type); } *Streaming List* Loading on demand is driven by StreamObject<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#object>, however StreamReader may reach maxOccurs>1 property value(s) before the demanded one, so the StreamReader needs to notify reached literal strings which are not demanded yet. Here's the protocol which can be used to communicate: interface StreamList<Type> { void addStreamValue (Object value); void addLiteralValue (QName typeXSI,String value); Object parseLiteralValue (QName typeXSI,String value,Type type); } *Modeling Frameworks* - Modeling-neutral StreamReader There're many Modeling Frameworks, in order for StreamReader to support as many of them as possible, here's a Modeling Framework adapter protocol: interface ModelingFramework<Type,Property> { Type type (Property property); boolean many (Property property); Collection getAliasNames (Property property); Class getInstanceClass (Type type); List<Property> properties (Type type); Property element (String space,String name); Object getNameSpace (Property property); Object getLocalName (Property property); enum PropertyKind { ELEMENT, ATTRIBUTE, OTHER } PropertyKind kind (Property property); int property (Type type,List<Property> properties,String space,String name,boolean element); StreamObject<Type,Property,?> create (String space,String name); StreamObject<Type,Property,?> create (Type type); } - JavaBean class JavaBeans implements ModelingFramework<Class,PropertyDescriptor> { public final*/*many*/* Class type (PropertyDescriptor property) { return property.getPropertyType(); } public boolean many (PropertyDescriptor property) { return List.class.isAssignableFrom( type( property)); } public Collection getAliasNames (PropertyDescriptor property) {*//TODO cache* return Collections.singleton( property.getName()); } public Class getInstanceClass (Class type) { return type; } public List<PropertyDescriptor> properties (Class type) {*//TODO cache* try { return Arrays.asList( Introspector.getBeanInfo( type).getPropertyDescriptors()); } catch(IntrospectionException e) {} return Collections.EMPTY_LIST; } public StreamObject<Class,PropertyDescriptor,?> create (Class type) { try { return (StreamObject<Class,PropertyDescriptor,?>)type.newInstance(); } catch(Exception e) {} return null; } } - Service Data Objects class SDO implements ModelingFramework<Type,Property> { public Type type (Property property) { return property.getType(); } public boolean many (Property property) { return property.isMany(); } public Collection getAliasNames (Property property) { return property.getAliasNames(); } public Class getInstanceClass (Type type) { return type.getInstanceClass(); } public List<Property> properties (Type type) { return type.getProperties(); } public Property element (String space,String name) { return XSDHelper.INSTANCE.getGlobalProperty( space, name, true); } public final Object getNameSpace (Property property) { return XSDHelper.INSTANCE.getNamespaceURI( property); } public final Object getLocalName (Property property) { return XSDHelper.INSTANCE.getLocalName( property); } public PropertyKind kind (Property property) { return XSDHelper.INSTANCE.isElement( property) ? PropertyKind.ELEMENT : XSDHelper.INSTANCE.isAttribute( property) ? PropertyKind.ATTRIBUTE : PropertyKind.OTHER; } public StreamObject<Type,Property,?> create (String space,String name) { return (StreamObject<Type,Property,?>)DataFactory.INSTANCE.create( space, name); } public StreamObject<Type,Property,?> create (Type type) { return (StreamObject<Type,Property,?>)DataFactory.INSTANCE.create( type); } } - Eclipse Modeling Framework class EMF implements ModelingFramework<EClassifier,EStructuralFeature> { public EClassifier type (EStructuralFeature property) { return property.getEType(); } public boolean many (EStructuralFeature property) { return property.isMany(); } public Collection getAliasNames (EStructuralFeature property) {*//TODO cache* return Collections.singleton( property.getName()); } public Class getInstanceClass (EClassifier type) { return type.getInstanceClass(); } public List<EStructuralFeature> properties (EClassifier type) { return ((EClass)type).getEAllStructuralFeatures(); } public EStructuralFeature element (String space,String name) { return ExtendedMetaData.INSTANCE.getElement( space, name); } public final Object getNameSpace (EStructuralFeature property) { return ExtendedMetaData.INSTANCE.getNamespace( property); } public final Object getLocalName (EStructuralFeature property) { return ExtendedMetaData.INSTANCE.getName( property); } public PropertyKind kind (EStructuralFeature property) { switch( ExtendedMetaData.INSTANCE.getFeatureKind( property) ) { case ExtendedMetaData.ELEMENT_FEATURE: return PropertyKind.ELEMENT; case ExtendedMetaData.ATTRIBUTE_FEATURE: return PropertyKind.ATTRIBUTE; } return PropertyKind.OTHER; } public int property (EClassifier type,List<EStructuralFeature> properties,String space,String name,boolean element) { final EStructuralFeature property = element ? ExtendedMetaData.INSTANCE.getElement( (EClass)type, space, name) : ExtendedMetaData.INSTANCE.getAttribute( (EClass)type, space, name); return null == property ? -1 : property.getFeatureID(); } public StreamObject<EClassifier,EStructuralFeature,?> create (String space,String name) { return (StreamObject<EClassifier,EStructuralFeature,?>)PackageFactory.create( space, name); } public StreamObject<EClassifier,EStructuralFeature,?> create (EClassifier type) { return (StreamObject<EClassifier,EStructuralFeature,?>)EcoreUtil.create( (EClass)type); } } *Implementation* - StreamObject<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#object>& StreamList<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#list>injection For existed code, if change to support StreamObject<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#object>& StreamList<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#list>isn't desired, injection may be utilitized. - Code Generation (static object) Code can be regenerated, or new code can be generated, to support StreamObject<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#object>& StreamList<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#list> . - Dynamic object Memory bindings such as Service Data Objects and Eclipse Modeling Framework, enable dynamic objects besides the static ones (CodeGen). Their implementation can be extended to support StreamObject<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#object>& StreamList<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#list> . - Concurrent access Since StreamObject<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#object>& StreamList<file:///C:/YangZhong/SourceRepository/共享代码/source/extension/model/StreamingObject%26List.HTML#list>are loading on demand, synchronization may be necessary for concurrent accesses. And there may be multiple objects loading from one stream, the synchronization may need to consider the shared one stream. You're much more than welcomed to comment. And if you find it happen to be interesting, I can also post/wiki the prototype. Help will be appreciated very much, especially areas such as code injection, JavaBean ModelingFramework implementation conforming to JAXB and test cases demonstrating performance gain by loading on demand. -- Yang ZHONG