[
https://issues.apache.org/jira/browse/CARBONDATA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kunal Kapoor updated CARBONDATA-4050:
-------------------------------------
Fix Version/s: (was: 2.1.0)
2.1.1
> TPC-DS queries performance degraded when compared to older versions due to
> redundant getFileStatus() invocations
> ----------------------------------------------------------------------------------------------------------------
>
> Key: CARBONDATA-4050
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4050
> Project: CarbonData
> Issue Type: Improvement
> Components: core
> Affects Versions: 2.0.0
> Reporter: Venugopal Reddy K
> Priority: Major
> Fix For: 2.1.1
>
> Time Spent: 1h 40m
> Remaining Estimate: 0h
>
> *Issue:*
> In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata
> files in the segment, loop through all the carbon files and make a map of
> fileNameToMetaInfoMapping<path-string, BlockMetaInfo>
> In that carbon files loop, if the file is of AbstractDFSCarbonFile
> type, we get the org.apache.hadoop.fs.FileStatus thrice for each file. And
> the method to get file status is an RPC call(fileSystem.getFileStatus(path)).
> It takes ~2ms in the cluster for each call. Thus, incur an overhead of ~6ms
> per file. So overall driver side query processing time has increased
> significantly when there are more carbon files. Hence caused TPC-DS queries
> performance degradation.
> Have shown the methods/calls which get the file status for the carbon file in
> loop:
> {code:java}
> public static Map<String, BlockMetaInfo>
> createCarbonDataFileBlockMetaInfoMapping(
> String segmentFilePath, Configuration configuration) throws IOException {
> Map<String, BlockMetaInfo> fileNameToMetaInfoMapping = new TreeMap();
> CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath,
> configuration);
> if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof
> S3CarbonFile)) {
> PathFilter pathFilter = new PathFilter() {
> @Override
> public boolean accept(Path path) {
> return CarbonTablePath.isCarbonDataFile(path.getName());
> }
> };
> CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
> for (CarbonFile file : carbonFiles) {
> String[] location = file.getLocations(); // RPC call - 1
> long len = file.getSize(); // RPC call - 2
> BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
> fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo); // RPC
> call - 3 in file.getpath() method
> }
> }
> return fileNameToMetaInfoMapping;
> }
> {code}
>
> *Suggestion:*
> I think, currently we make RPC call to get the file status upon each
> invocation because file status may change over a period of time. And we
> shouldn't cache the file status in AbstractDFSCarbonFile.
> In the current case, just before the loop of carbon files, we get the
> file status of all the carbon files in the segment with RPC call shown below.
> LocatedFileStatus is a child class of FileStatus. It has BlockLocation along
> with file status.
> {code:java}
> RemoteIterator<LocatedFileStatus> iter =
> fileSystem.listLocatedStatus(path);{code}
> Intention of getting all the file status here is to create instance
> of BlockMetaInfo and maintain the map of fileNameToMetaInfoMapping.
> So it is safe to avoid these unnecessary rpc calls to get file status again
> in getLocations(), getSize() and getPath() methods.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)