[
https://issues.apache.org/jira/browse/XERCESJ-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849557#comment-17849557
]
Radu Coravu commented on XERCESJ-1205:
--------------------------------------
14 years later after my last comment on this issue.... :)
[~mmazari]
It's hard to contribute fixes to the Xerces project in general, it's not like
you can open a pull request with a signoff like in most open source projects.
https://xerces.apache.org/charter.html#CONTRIBUTORS
My advice to you is to create your own fork of the Xerces code, add patches and
build your own JAR library. I work for Oxygen XML Editor, Apache Xerces is one
of (even *the*) most important library in our application, we ship inside
Oxygen a customized Xerces library containing various fixes (like this one or
security related). We also have automatic tests on our side to at least attempt
to covert various problems which might appear when the caching is enabled and
consecutive XML documents are being parsed.
Lately [~mukulg] has been about the only active contributor to the Apache
Xerces project, but this there is so much someone can do on an open source
project especially if they invest their own personal time in this.
Some more remarks to your questions below:
{quote}The provided solution loads all entities into the entity manager each
time. In many applications, thousands of entities are present, but only a few
are utilized. Therefore, I propose loading entities only when necessary.{quote}
How do you know which entities are necessary or not? The entity references are
found while the file is parsed. So I think loading the all is the only solution
I can think of right now.
{quote}I agree with Radu Coravu that the publicId should be used to identify
the external entity. However, XMLDTDScannerImpl#scanEntityDecl is actually
using only the publicId without the systemId.{quote}
I'm not sure if I said this anywhere on this issue. My suggestion was to cache
external entities, whether they have public or systemId's specified.
{quote}XMLDTDScannerImpl#scanEntityDecl is adding unparsed entities that is
being identified by the notation. However, the provided solutions do not
consider them.{quote}
So you are saying that the method
"org.apache.xerces.impl.XMLEntityManager.initFromDTD(DTDGrammar)" which copies
entities from the grammar should also copy notations, right? Interesting, maybe
indeed it should.
{quote} I'm wondering if we need to reuse the DTD grammar when parsing XML
files with different internal subsets only. Basically, when caching is turned
off and parsing begins, if the entity is already loaded, the convention in the
entity manager is to report a warning in case of a duplicate entity
duplication.{quote}
In general such cache changes need to backed up by automatic tests. So one
would have to write enough automatic tests to cover a wide range of cases and
then try to make it all work. Because no one wants to enable caching and have
the parsing break somewhere in one of the parsed XML files do to the cache from
a previous parse operation being used. At least the Xerces code right now is
safer by not properly implementing the cache, at least it properly parses the
XML files (although slower).
{quote}The code above uses addExternalEntity function, which will be affected
by the current entity fCurrentEntity. However, I believe we should consider the
cached and stored baseSystemID after caching without being affected by the
current entity in the entity manager when parsing documents.{quote}
I'm not sure, I would need to probably spend an hour or two to understand again
the caching code (which I'm not really willing to do), ideally as I said we
would have test cases for each situation, test cases with various ways in which
XML documents refer to DTDs (with public IDs, system IDs, internal DTD
declarations). And for system IDs again some may be resolved through an XML
catalog to an absolute DTD location, some may be relative and resolve relative
to the XML document (hopefully in rare cases).
So an offered patch (like the ones on this issue) without a set of robust
automatic tests is not something very robust.
> Entity resolution does not work with DTD grammar caching resolved
> -----------------------------------------------------------------
>
> Key: XERCESJ-1205
> URL: https://issues.apache.org/jira/browse/XERCESJ-1205
> Project: Xerces2-J
> Issue Type: Bug
> Components: DTD
> Affects Versions: 2.8.1
> Environment: JDK1.5. The issue appears on various machines, Windows,
> Linux, Mac OSX. I don't believe it is platform specific.
> Reporter: Tin Pavlinic
> Assignee: Michael Glavassevich
> Priority: Major
> Attachments: XERCESJ-1205.patch, XERCESJ-1465.patch, bug.zip,
> entitypatch-r1813171.patch
>
>
> We have a DTD which defines some entities. We are parsing multiple documents
> against this DTD. If grammar caching is enabled, the entities are unresolved
> when the grammar is loaded from the cache, instead of the DTD.
> It seems that they are cleared every time a document is parsed and are only
> loaded when a DTD is loaded and not from the cache.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]