[
https://issues.apache.org/jira/browse/DAFFODIL-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Lawrence updated DAFFODIL-2918:
-------------------------------------
Description:
The SchemaFileLocation class contains diagnostic information, including
line/column number, a file path used for diagnostics, a URI, etc.
DAFFODIL-2195 made changes so that the file path used for diagnostics is
depersonalized and and should be reproducible. However, the uriString member in
the SchemaFileLocation is an absolute URI that is not depersonalized. Although
this URI is only used for resolving imports, it is still serialized in saved
parsers and so can make reproducible saved parsers if they are built from
different directories.
To reproduce, create a saved parser for a schema, then move that schema to a
different directory and create the saved parser again. The saved parsers will
have different hashes. Here's a command that can find the paths in a saved
parser:
cat saved-parser.bin | tail -z -n +2 | gunzip -c | strings | grep dfdl.xsd
The tail removes the header in the saved parser file, then we uncompress the
remaining serialized parser, pull out all the strings, and display any that
contain a DFDL schema extension. There should be a bunch of absolute URI's that
contain non-depersonalized paths that can cause reproducibility issues.
To fix this, we should try to remove uriString from SchemaFileLocation by just
passing it around to various functions, maybe storing it somewhere that isn't
serialized. If that isn't possible, an alternative could be to make uriString
transient--it should only ever be used to resolve imports, so once a saved
parser is created it should never be needed again once reloaded.
Note that it is possible that some tools (e.g. VS Daffodil Extension) might
need the absolute URI from diagnostics. By removing uriString, they no longer
have access to that, so we may need to add a toggle that allows diagnosticFile
to keep the full absolute path, essentially making depersonalization an
optional feature. Keep in mind that diagnosticFile is a File which can't
represent jar URI's, so it may need to be changed to a String.
was:
The SchemaFileLocation class contains diagnostic information, including
line/column number, a file path used for diagnostics, a URI, etc.
DAFFODIL-2195 made changes so that the file path used for diagnostics is
depersonalized and and should be reproducible. However, the uriString member in
the SchemaFileLocation is an absolute URI that is not depersonalized. Although
this URI is only used for resolving imports, it is still serialized in saved
parsers and so can make reproducible saved parsers if they are built from
different directories.
To reproduce, create a saved parser for a schema, then move that schema to a
different directory and create the saved parser again. The saved parsers will
have different hashes. Here's a command that can find the paths in a saved
parser:
cat saved-parser.bin | sed '1 s/^[^\x0]\+\x0//' | gunzip -c | strings | grep
dfdl.xsd
The sed removes the header in the saved parser file, then we uncompress the
serialized parser, pull out all the strings, and display any that contain a
DFDL schema extension. There should be a bunch of absolute URI's that contain
non-depersonalized paths that an cause reproducibility issues.
To fix this, we should try to remove uriString from SchemaFileLocation by just
passing it around to various functions, maybe storing it somewhere that isn't
serialized. If that isn't possible, an alternative could be to make uriString
transient--it should only ever be used to resolve imports, so once a saved
parser is created it should never be needed again once reloaded.
Note that it is possible that some tools (e.g. VS Daffodil Extension) might
need the absolute URI from diagnostics. By removing uriString, they no longer
have access to that, so we may need to add a toggle that allows diagnosticFile
to keep the full absolute path, essentially making depersonalization an
optional feature. Keep in mind that diagnosticFile is a File which can't
represent jar URI's, so it may need to be changed to a String.
> SchemaFileLocation uriString leads to non-reproducible saved parsers
> --------------------------------------------------------------------
>
> Key: DAFFODIL-2918
> URL: https://issues.apache.org/jira/browse/DAFFODIL-2918
> Project: Daffodil
> Issue Type: Bug
> Components: Front End
> Affects Versions: 3.8.0
> Reporter: Steve Lawrence
> Priority: Minor
>
> The SchemaFileLocation class contains diagnostic information, including
> line/column number, a file path used for diagnostics, a URI, etc.
> DAFFODIL-2195 made changes so that the file path used for diagnostics is
> depersonalized and and should be reproducible. However, the uriString member
> in the SchemaFileLocation is an absolute URI that is not depersonalized.
> Although this URI is only used for resolving imports, it is still serialized
> in saved parsers and so can make reproducible saved parsers if they are built
> from different directories.
> To reproduce, create a saved parser for a schema, then move that schema to a
> different directory and create the saved parser again. The saved parsers will
> have different hashes. Here's a command that can find the paths in a saved
> parser:
> cat saved-parser.bin | tail -z -n +2 | gunzip -c | strings | grep dfdl.xsd
> The tail removes the header in the saved parser file, then we uncompress the
> remaining serialized parser, pull out all the strings, and display any that
> contain a DFDL schema extension. There should be a bunch of absolute URI's
> that contain non-depersonalized paths that can cause reproducibility issues.
> To fix this, we should try to remove uriString from SchemaFileLocation by
> just passing it around to various functions, maybe storing it somewhere that
> isn't serialized. If that isn't possible, an alternative could be to make
> uriString transient--it should only ever be used to resolve imports, so once
> a saved parser is created it should never be needed again once reloaded.
> Note that it is possible that some tools (e.g. VS Daffodil Extension) might
> need the absolute URI from diagnostics. By removing uriString, they no longer
> have access to that, so we may need to add a toggle that allows
> diagnosticFile to keep the full absolute path, essentially making
> depersonalization an optional feature. Keep in mind that diagnosticFile is a
> File which can't represent jar URI's, so it may need to be changed to a
> String.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)