paul-rogers opened a new issue, #13359:
URL: https://github.com/apache/druid/issues/13359

   Serialization of the `LocalInputSourc` object converts relative file paths 
to absolute paths, changing the meaning of the MSQ ingest query.
   
   ### Affected Version
   
   Latest 25.0.0-SNAPSHOT
   
   ### Description
   
   MSQ introduces the `extern` function to specify an external data source. One 
of the possible input source is `LocalInputSource`, one of the properties 
defined as:
   
   ```java
     @JsonProperty
     @JsonInclude(JsonInclude.Include.NON_EMPTY)
     public List<File> getFiles()
     {
       return files;
     }
   ```
   
   Unfortunately, it seems that Jackson does not take the actual value of the 
`File` object, but rather the absolute path. Doing so completely changes the 
meaning of the query.
   
   This was discovered in writing a unit test for the catalog extensions to 
MSQ. Here is the definition used:
   
   ```java
     protected final ExternalDataSource localDataSource = new 
ExternalDataSource(
         new LocalInputSource(
             new File("/tmp"),
             "*.csv",
             Arrays.asList(new File("foo.csv"), new File("bar.csv"))
         ),
         ...
   ```
   
   This is then run though the existing `CalciteIngestionDmlTest.externSql()` 
method which returns:
   
   ```text
   
TABLE(extern(U&'\007B\0022type\0022\003A\0022local\0022\002C\0022baseDir\0022\003A\0022\002Ftmp\0022\002C\0022filter\0022\003A\0022\002A\002Ecsv\0022\002C\0022files\0022\003A\005B\0022\002FUsers\002Fpaul\002Fgit\002Fdruid\002Fsql\002Ffoo\002Ecsv\0022\002C\0022\002FUsers\002Fpaul\002Fgit\002Fdruid\002Fsql\002Fbar\002Ecsv\0022\005D\007D',
 
U&'\007B\0022type\0022\003A\0022csv\0022\002C\0022columns\0022\003A\005B\0022x\0022\002C\0022y\0022\002C\0022z\0022\005D\007D',
 
U&'\005B\007B\0022name\0022\003A\0022x\0022\002C\0022type\0022\003A\0022STRING\0022\007D\002C\007B\0022name\0022\003A\0022y\0022\002C\0022type\0022\003A\0022STRING\0022\007D\002C\007B\0022name\0022\003A\0022z\0022\002C\0022type\0022\003A\0022LONG\0022\007D\005D'))
   ```
   
   Unfortunately, the above is hard to read. But, notice this excerpt:
   
   ```text
   
files\0022\003A\005B\0022\002FUsers\002Fpaul\002Fgit\002Fdruid\002Fsql\002Ffoo\002Ecsv
   ```
   
   That indicates that Jackson has included the absolute path (relative to my 
Druid build directory) in the list of files.
   
   As a result, query validation failse:
   
   ```text
   expected:
   
   <ScanQuery{dataSource='ExternalDataSource{inputSource=LocalInputSource{
       baseDir="/tmp",filter=*.csv,files=[foo.csv, bar.csv]}, ...
   
   but was:
   
   <ScanQuery{dataSource='ExternalDataSource{inputSource=LocalInputSource{
       baseDir="/tmp",filter=*.csv,files=[/Users/paul/git/druid/sql/foo.csv, 
/Users/paul/git/druid/sql/bar.csv]}, ...
   ```
   
   ### Impact
   
   Though the above was found during unit testing of a MSQ table function, the 
issue very likely affects all users who use the local input source. Anytime we 
serialize that object (such as creating the MSQ controller task), the meaning 
of the `files` argument will change.
   
   ### Suggested Fix
   
   The correct fix is probably to change the `baseDir` and `files` parameters 
to work with `String`s rather than with `File` objects. This would be a 
breaking change for anyone using the code API, but should be transparent to 
anyone using the JSON form. One mediation is to provide a second constructor 
that is backward compatible, while changing the JSON versions to work with 
`String` objects.
   
   ### Workaround
   
   The workaround is to not use the `baseDir` property and always include the 
absolute path in the `files` argument. This workaround was used to allow the 
unit test in question to pass, but will be surprising to users who read the 
documentation.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to