[jira] [Updated] (CARBONDATA-4050) TPC-DS queries performance degraded when compared to older versions due to redundant getFileStatus() invocations

Venugopal Reddy K (Jira) Wed, 18 Nov 2020 09:50:10 -0800


     [ 
https://issues.apache.org/jira/browse/CARBONDATA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Venugopal Reddy K updated CARBONDATA-4050:
------------------------------------------
    Description: 
*Issue:*

In {{createCarbonDataFileBlockMetaInfoMapping }}method, we get list of 
carbondata files in the segment, loop through all the carbon files and make a 
map of {{fileNameToMetaInfoMapping<path-string, BlockMetaInfo>}}

      In that carbon files loop, if the file is of {{AbstractDFSCarbonFile}} 
type, we get the {{org.apache.hadoop.fs.FileStatus}} thrice for each file. And 
the method to get file status is an RPC 
call({{fileSystem.getFileStatus(path)}}). It takes ~2ms in the cluster for each 
call. Thus, incurs an overhead of ~6ms per file. So overall driver side query 
processing time has increased significantly.

Have highlighted the methods/calls which get the file status for the carbon 
file in loop

 
{code:java}
public static Map<String, BlockMetaInfo> 
createCarbonDataFileBlockMetaInfoMapping(
    String segmentFilePath, Configuration configuration) throws IOException {
  Map<String, BlockMetaInfo> fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
    PathFilter pathFilter = new PathFilter() {
      @Override
      public boolean accept(Path path) {
        return CarbonTablePath.isCarbonDataFile(path.getName());
      }
    };
    CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
    for (CarbonFile file : carbonFiles) {
      String[] location = file.getLocations();
      long len = file.getSize();
      BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
      fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo);
    }
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status.

 
{code:java}
RemoteIterator<LocatedFileStatus> iter = 
fileSystem.listLocatedStatus(path);{code}
Intention of getting all the file status here is to create instance of 
{{BlockMetaInfo}} and maintain the map of {{fileNameToMetaInfoMapping.}}

So it is safe to avoid these unnecessary rpc calls to get file status again in 
getLocations(), getSize() and getPath() methods.

 

  was:
*Issue:*

In ** {{createCarbonDataFileBlockMetaInfoMapping }}method, we get list 
carbondata files in the segment, loop through all the carbon files and make a 
map of {{fileNameToMetaInfoMapping<path-string, BlockMetaInfo>}}

      In that carbon files loop, if the file is of {{AbstractDFSCarbonFile 
}}type, we get the{{ org.apache.hadoop.fs.FileStatus}} thrice for each file.{{ 
}}And the method to get file status is an RPC 
call({{fileSystem.getFileStatus(path)}}). It takes ~2ms in the cluster for each 
call. Thus, incurs an overhead of ~6ms per file. So overall driver side query 
processing time has increased significantly.

Have highlighted the methods/calls which get the file status for the carbon 
file in loop

 
{code:java}
public static Map<String, BlockMetaInfo> 
createCarbonDataFileBlockMetaInfoMapping(
    String segmentFilePath, Configuration configuration) throws IOException {
  Map<String, BlockMetaInfo> fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
    PathFilter pathFilter = new PathFilter() {
      @Override
      public boolean accept(Path path) {
        return CarbonTablePath.isCarbonDataFile(path.getName());
      }
    };
    CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
    for (CarbonFile file : carbonFiles) {
      String[] location = file.getLocations();
      long len = file.getSize();
      BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
      fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo);
    }
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status.

 
{code:java}
RemoteIterator<LocatedFileStatus> iter = 
fileSystem.listLocatedStatus(path);{code}
Intention of getting all the file status here is to create instance of 
{{BlockMetaInfo}} and maintain the map of {{fileNameToMetaInfoMapping.}}

So it is safe to avoid these unnecessary rpc calls to get file status again in 
getLocations(), getSize() and getPath() methods.

 


> TPC-DS queries performance degraded when compared to older versions due to 
> redundant getFileStatus() invocations
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: CARBONDATA-4050
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-4050
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 2.0.0
>            Reporter: Venugopal Reddy K
>            Priority: Major
>             Fix For: 2.1.0
>
>
> *Issue:*
> In {{createCarbonDataFileBlockMetaInfoMapping }}method, we get list of 
> carbondata files in the segment, loop through all the carbon files and make a 
> map of {{fileNameToMetaInfoMapping<path-string, BlockMetaInfo>}}
>       In that carbon files loop, if the file is of {{AbstractDFSCarbonFile}} 
> type, we get the {{org.apache.hadoop.fs.FileStatus}} thrice for each file. 
> And the method to get file status is an RPC 
> call({{fileSystem.getFileStatus(path)}}). It takes ~2ms in the cluster for 
> each call. Thus, incurs an overhead of ~6ms per file. So overall driver side 
> query processing time has increased significantly.
> Have highlighted the methods/calls which get the file status for the carbon 
> file in loop
>  
> {code:java}
> public static Map<String, BlockMetaInfo> 
> createCarbonDataFileBlockMetaInfoMapping(
>     String segmentFilePath, Configuration configuration) throws IOException {
>   Map<String, BlockMetaInfo> fileNameToMetaInfoMapping = new TreeMap();
>   CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
> configuration);
>   if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
> S3CarbonFile)) {
>     PathFilter pathFilter = new PathFilter() {
>       @Override
>       public boolean accept(Path path) {
>         return CarbonTablePath.isCarbonDataFile(path.getName());
>       }
>     };
>     CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
>     for (CarbonFile file : carbonFiles) {
>       String[] location = file.getLocations();
>       long len = file.getSize();
>       BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
>       fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo);
>     }
>   }
>   return fileNameToMetaInfoMapping;
> }
> {code}
>  
>  
> *Suggestion:*
> I think, currently we make RPC call to get the file status upon each 
> invocation because file status may change over a period of time. And we 
> shouldn't cache the file status in AbstractDFSCarbonFile.
>      In the current case, just before the loop of carbon files, we get the 
> file status of all the carbon files in the segment with RPC call shown below. 
> LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
> with file status.
>  
> {code:java}
> RemoteIterator<LocatedFileStatus> iter = 
> fileSystem.listLocatedStatus(path);{code}
> Intention of getting all the file status here is to create instance of 
> {{BlockMetaInfo}} and maintain the map of {{fileNameToMetaInfoMapping.}}
> So it is safe to avoid these unnecessary rpc calls to get file status again 
> in getLocations(), getSize() and getPath() methods.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (CARBONDATA-4050) TPC-DS queries performance degraded when compared to older versions due to redundant getFileStatus() invocations

Reply via email to