paul-rogers opened a new issue, #13359:
URL: https://github.com/apache/druid/issues/13359
Serialization of the `LocalInputSourc` object converts relative file paths
to absolute paths, changing the meaning of the MSQ ingest query.
### Affected Version
Latest 25.0.0-SNAPSHOT
### Description
MSQ introduces the `extern` function to specify an external data source. One
of the possible input source is `LocalInputSource`, one of the properties
defined as:
```java
@JsonProperty
@JsonInclude(JsonInclude.Include.NON_EMPTY)
public List<File> getFiles()
{
return files;
}
```
Unfortunately, it seems that Jackson does not take the actual value of the
`File` object, but rather the absolute path. Doing so completely changes the
meaning of the query.
This was discovered in writing a unit test for the catalog extensions to
MSQ. Here is the definition used:
```java
protected final ExternalDataSource localDataSource = new
ExternalDataSource(
new LocalInputSource(
new File("/tmp"),
"*.csv",
Arrays.asList(new File("foo.csv"), new File("bar.csv"))
),
...
```
This is then run though the existing `CalciteIngestionDmlTest.externSql()`
method which returns:
```text
TABLE(extern(U&'\007B\0022type\0022\003A\0022local\0022\002C\0022baseDir\0022\003A\0022\002Ftmp\0022\002C\0022filter\0022\003A\0022\002A\002Ecsv\0022\002C\0022files\0022\003A\005B\0022\002FUsers\002Fpaul\002Fgit\002Fdruid\002Fsql\002Ffoo\002Ecsv\0022\002C\0022\002FUsers\002Fpaul\002Fgit\002Fdruid\002Fsql\002Fbar\002Ecsv\0022\005D\007D',
U&'\007B\0022type\0022\003A\0022csv\0022\002C\0022columns\0022\003A\005B\0022x\0022\002C\0022y\0022\002C\0022z\0022\005D\007D',
U&'\005B\007B\0022name\0022\003A\0022x\0022\002C\0022type\0022\003A\0022STRING\0022\007D\002C\007B\0022name\0022\003A\0022y\0022\002C\0022type\0022\003A\0022STRING\0022\007D\002C\007B\0022name\0022\003A\0022z\0022\002C\0022type\0022\003A\0022LONG\0022\007D\005D'))
```
Unfortunately, the above is hard to read. But, notice this excerpt:
```text
files\0022\003A\005B\0022\002FUsers\002Fpaul\002Fgit\002Fdruid\002Fsql\002Ffoo\002Ecsv
```
That indicates that Jackson has included the absolute path (relative to my
Druid build directory) in the list of files.
As a result, query validation failse:
```text
expected:
<ScanQuery{dataSource='ExternalDataSource{inputSource=LocalInputSource{
baseDir="/tmp",filter=*.csv,files=[foo.csv, bar.csv]}, ...
but was:
<ScanQuery{dataSource='ExternalDataSource{inputSource=LocalInputSource{
baseDir="/tmp",filter=*.csv,files=[/Users/paul/git/druid/sql/foo.csv,
/Users/paul/git/druid/sql/bar.csv]}, ...
```
### Impact
Though the above was found during unit testing of a MSQ table function, the
issue very likely affects all users who use the local input source. Anytime we
serialize that object (such as creating the MSQ controller task), the meaning
of the `files` argument will change.
### Suggested Fix
The correct fix is probably to change the `baseDir` and `files` parameters
to work with `String`s rather than with `File` objects. This would be a
breaking change for anyone using the code API, but should be transparent to
anyone using the JSON form. One mediation is to provide a second constructor
that is backward compatible, while changing the JSON versions to work with
`String` objects.
### Workaround
The workaround is to not use the `baseDir` property and always include the
absolute path in the `files` argument. This workaround was used to allow the
unit test in question to pass, but will be surprising to users who read the
documentation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]