gengliangwang opened a new pull request #23774: [SPARK-26871][SQL]File Source 
V2: avoid creating unnecessary FileIndex in the write path
URL: https://github.com/apache/spark/pull/23774
 
 
   ## What changes were proposed in this pull request?
   
   In https://github.com/apache/spark/pull/23383, the file source V2 framework 
is implemented. In the PR, `FileIndex` is created as a member of `FileTable`, 
so that we can implement partition pruning like 
https://github.com/apache/spark/commit/0f9fcabb4ac2e8afec14d010e86467372a85d334 
in the future(As data source V2 catalog is under development, partition pruning 
is removed from the PR)
   
   However, after write path of file source V2 is implemented, I find that a 
simple write will create an unnecessary `FileIndex`, which is required by 
`FileTable`. This is a sort of regression. And we can see there is a warning 
message when writing to ORC files
   ``` 
   WARN InMemoryFileIndex: The directory file:/tmp/foo was not found. Was it 
deleted very recently?
   ```
   This PR is to make `FileIndex` as a lazy value in `FileTable`, so that we 
can avoid creating unnecessary `FileIndex` in the write path.
   
   ## How was this patch tested?
   
   Existing unit test
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to