[jira] [Commented] (XERCESJ-1205) Entity resolution does not work with DTD grammar caching resolved

Radu Coravu (Jira) Sun, 26 May 2024 08:32:08 -0700


    [ 
https://issues.apache.org/jira/browse/XERCESJ-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849557#comment-17849557
 ]


Radu Coravu commented on XERCESJ-1205:
--------------------------------------

14 years later after my last comment on this issue.... :)

[~mmazari]

It's hard to contribute fixes to the Xerces project in general, it's not like 
you can open a pull request with a signoff like in most open source projects.
https://xerces.apache.org/charter.html#CONTRIBUTORS

My advice to you is to create your own fork of the Xerces code, add patches and 
build your own JAR library. I work for Oxygen XML Editor, Apache Xerces is one 
of (even *the*) most important library in our application, we ship inside 
Oxygen a customized Xerces library containing various fixes (like this one or 
security related). We also have automatic tests on our side to at least attempt 
to covert various problems which might appear when the caching is enabled and 
consecutive XML documents are being parsed. 

Lately [~mukulg] has been about the only active contributor to the Apache 
Xerces project, but this there is so much someone can do on an open source 
project especially if they invest their own personal time in this.

Some more remarks to your questions below:

{quote}The provided solution loads all entities into the entity manager each 
time. In many applications, thousands of entities are present, but only a few 
are utilized. Therefore, I propose loading entities only when necessary.{quote}

How do you know which entities are necessary or not? The entity references are 
found while the file is parsed. So I think loading the all is the only solution 
I can think of right now.

{quote}I agree with Radu Coravu that the publicId should be used to identify 
the external entity. However, XMLDTDScannerImpl#scanEntityDecl is actually 
using only the publicId without the systemId.{quote}
I'm not sure if I said this anywhere on this issue. My suggestion was to cache 
external entities, whether they have public or systemId's specified. 

{quote}XMLDTDScannerImpl#scanEntityDecl is adding unparsed entities that is 
being identified by the notation. However, the provided solutions do not 
consider them.{quote}
So you are saying that the method 
"org.apache.xerces.impl.XMLEntityManager.initFromDTD(DTDGrammar)" which copies 
entities from the grammar should also copy notations, right? Interesting, maybe 
indeed it should.

{quote} I'm wondering if we need to reuse the DTD grammar when parsing XML 
files with different internal subsets only. Basically, when caching is turned 
off and parsing begins, if the entity is already loaded, the convention in the 
entity manager is to report a warning in case of a duplicate entity 
duplication.{quote}
In general such cache changes need to backed up by automatic tests. So one 
would have to write enough automatic tests to cover a wide range of cases and 
then try to make it all work. Because no one wants to enable caching and have 
the parsing break somewhere in one of the parsed XML files do to the cache from 
a previous parse operation being used. At least the Xerces code right now is 
safer by not properly implementing the cache, at least it properly parses the 
XML files (although slower).

{quote}The code above uses addExternalEntity function, which will be affected 
by the current entity fCurrentEntity. However, I believe we should consider the 
cached and stored baseSystemID after caching without being affected by the 
current entity in the entity manager when parsing documents.{quote}
I'm not sure, I would need to probably spend an hour or two to understand again 
the caching code (which I'm not really willing to do), ideally as I said we 
would have test cases for each situation, test cases with various ways in which 
XML documents refer to DTDs (with public IDs, system IDs, internal DTD 
declarations). And for system IDs again some may be resolved through an XML 
catalog to an absolute DTD location, some may be relative and resolve relative 
to the XML document (hopefully in rare cases).
So an offered patch (like the ones on this issue) without a set of robust 
automatic tests is not something very robust.

> Entity resolution does not work with DTD grammar caching resolved
> -----------------------------------------------------------------
>
>                 Key: XERCESJ-1205
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1205
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: DTD
>    Affects Versions: 2.8.1
>         Environment: JDK1.5. The issue appears on various machines, Windows, 
> Linux, Mac OSX. I don't believe it is platform specific.
>            Reporter: Tin Pavlinic
>            Assignee: Michael Glavassevich
>            Priority: Major
>         Attachments: XERCESJ-1205.patch, XERCESJ-1465.patch, bug.zip, 
> entitypatch-r1813171.patch
>
>
> We have a DTD which defines some entities. We are parsing multiple documents 
> against this DTD. If grammar caching is enabled, the entities are unresolved 
> when the grammar is loaded from the cache, instead of the DTD. 
> It seems that they are cleared every time a document is parsed and are only 
> loaded when a DTD is loaded and not from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (XERCESJ-1205) Entity resolution does not work with DTD grammar caching resolved

Reply via email to