[ 
https://issues.apache.org/jira/browse/DAFFODIL-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Lawrence updated DAFFODIL-2918:
-------------------------------------
    Description: 
The SchemaFileLocation class contains diagnostic information, including 
line/column number, a file path used for diagnostics, a URI, etc.

DAFFODIL-2195 made changes so that the file path used for diagnostics is 
depersonalized and and should be reproducible. However, the uriString member in 
the SchemaFileLocation is an absolute URI that is not depersonalized. Although 
this URI is only used for resolving imports, it is still serialized in saved 
parsers and so can make reproducible saved parsers if they are built from 
different directories.

To reproduce, create a saved parser for a schema, then move that schema to a 
different directory and create the saved parser again. The saved parsers will 
have different hashes. Here's a command that can find the paths in a saved 
parser:

cat saved-parser.bin | tail -z -n +2 | gunzip -c | strings | grep dfdl.xsd

The tail removes the header in the saved parser file, then we uncompress the 
remaining serialized parser, pull out all the strings, and display any that 
contain a DFDL schema extension. There should be a bunch of absolute URI's that 
contain non-depersonalized paths that can cause reproducibility issues. 

To fix this, we should try to remove uriString from SchemaFileLocation by just 
passing it around to various functions, maybe storing it somewhere that isn't 
serialized. If that isn't possible, an alternative could be to make uriString 
transient--it should only ever be used to resolve imports, so once a saved 
parser is created it should never be needed again once reloaded.

Note that it is possible that some tools (e.g. VS Daffodil Extension) might 
need the absolute URI from diagnostics. By removing uriString, they no longer 
have access to that, so we may need to add a toggle that allows diagnosticFile 
to keep the full absolute path, essentially making depersonalization an 
optional feature. Keep in mind that diagnosticFile is a File which can't 
represent jar URI's, so it may need to be changed to a String.

  was:
The SchemaFileLocation class contains diagnostic information, including 
line/column number, a file path used for diagnostics, a URI, etc.

DAFFODIL-2195 made changes so that the file path used for diagnostics is 
depersonalized and and should be reproducible. However, the uriString member in 
the SchemaFileLocation is an absolute URI that is not depersonalized. Although 
this URI is only used for resolving imports, it is still serialized in saved 
parsers and so can make reproducible saved parsers if they are built from 
different directories.

To reproduce, create a saved parser for a schema, then move that schema to a 
different directory and create the saved parser again. The saved parsers will 
have different hashes. Here's a command that can find the paths in a saved 
parser:

cat saved-parser.bin | sed '1 s/^[^\x0]\+\x0//' | gunzip -c | strings | grep 
dfdl.xsd

The sed removes the header in the saved parser file, then we uncompress the 
serialized parser, pull out all the strings, and display any that contain a 
DFDL schema extension. There should be a bunch of absolute URI's that contain 
non-depersonalized paths that an cause reproducibility issues. 

To fix this, we should try to remove uriString from SchemaFileLocation by just 
passing it around to various functions, maybe storing it somewhere that isn't 
serialized. If that isn't possible, an alternative could be to make uriString 
transient--it should only ever be used to resolve imports, so once a saved 
parser is created it should never be needed again once reloaded.

Note that it is possible that some tools (e.g. VS Daffodil Extension) might 
need the absolute URI from diagnostics. By removing uriString, they no longer 
have access to that, so we may need to add a toggle that allows diagnosticFile 
to keep the full absolute path, essentially making depersonalization an 
optional feature. Keep in mind that diagnosticFile is a File which can't 
represent jar URI's, so it may need to be changed to a String.


> SchemaFileLocation uriString leads to non-reproducible saved parsers
> --------------------------------------------------------------------
>
>                 Key: DAFFODIL-2918
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2918
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Front End
>    Affects Versions: 3.8.0
>            Reporter: Steve Lawrence
>            Priority: Minor
>
> The SchemaFileLocation class contains diagnostic information, including 
> line/column number, a file path used for diagnostics, a URI, etc.
> DAFFODIL-2195 made changes so that the file path used for diagnostics is 
> depersonalized and and should be reproducible. However, the uriString member 
> in the SchemaFileLocation is an absolute URI that is not depersonalized. 
> Although this URI is only used for resolving imports, it is still serialized 
> in saved parsers and so can make reproducible saved parsers if they are built 
> from different directories.
> To reproduce, create a saved parser for a schema, then move that schema to a 
> different directory and create the saved parser again. The saved parsers will 
> have different hashes. Here's a command that can find the paths in a saved 
> parser:
> cat saved-parser.bin | tail -z -n +2 | gunzip -c | strings | grep dfdl.xsd
> The tail removes the header in the saved parser file, then we uncompress the 
> remaining serialized parser, pull out all the strings, and display any that 
> contain a DFDL schema extension. There should be a bunch of absolute URI's 
> that contain non-depersonalized paths that can cause reproducibility issues. 
> To fix this, we should try to remove uriString from SchemaFileLocation by 
> just passing it around to various functions, maybe storing it somewhere that 
> isn't serialized. If that isn't possible, an alternative could be to make 
> uriString transient--it should only ever be used to resolve imports, so once 
> a saved parser is created it should never be needed again once reloaded.
> Note that it is possible that some tools (e.g. VS Daffodil Extension) might 
> need the absolute URI from diagnostics. By removing uriString, they no longer 
> have access to that, so we may need to add a toggle that allows 
> diagnosticFile to keep the full absolute path, essentially making 
> depersonalization an optional feature. Keep in mind that diagnosticFile is a 
> File which can't represent jar URI's, so it may need to be changed to a 
> String.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to