[jira] [Comment Edited] (HBASE-12223) MultiTableInputFormatBase.getSplits is too slow

Ted Yu (JIRA) Thu, 09 Oct 2014 19:37:45 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-12223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166178#comment-14166178
 ]


Ted Yu edited comment on HBASE-12223 at 10/10/14 2:37 AM:
----------------------------------------------------------

use AdmasterMultiTableInputFormatBase.class significantly reduce the time of 
the function getSplits, when the situation is one table multi scans or multi 
table one scan or multi table multi scans. Because in function getSplits of 
AdmasterMultiTableInputFormatBase,the operation of table.getStartEndKeys() is 
very time-consuming;We made the following modifications: the same table, take 
only executes the operation of table.getStartEndKeys() once,it will greatly 
shorten the time.
{code}
@Override
  public List<InputSplit> getSplits(JobContext context) throws IOException {
      if (scans.isEmpty()) {
          throw new IOException("No scans were provided.");
      }

      HashMap<String, List<Scan>> tableMaps = new HashMap<String, List<Scan>>();
      for (Scan scan : scans) {
          String tableNameStr = 
Bytes.toString(scan.getAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME));
          if (tableMaps.containsKey(tableNameStr)) {
              tableMaps.get(tableNameStr).add(scan);
          } else {
              ArrayList<Scan> scanList = new ArrayList<Scan>();
              scanList.add(scan);
              tableMaps.put(tableNameStr, scanList);
          }
      }

      List<InputSplit> splits = new ArrayList<InputSplit>();
      Iterator iter = tableMaps.entrySet().iterator();
      while (iter.hasNext()) {
          Map.Entry<String, List<Scan>> entry = (Map.Entry<String, List<Scan>>) 
iter.next();
          String tableNameStr = entry.getKey();
          List<Scan> scanList = entry.getValue();
          HTable table = new HTable(context.getConfiguration(), tableNameStr);
          Pair<byte[][], byte[][]> keys = table.getStartEndKeys();
          for (Scan scan : scanList) {
              if (tableNameStr == null)
                  throw new IOException("A scan object did not have a table 
name");
              if (keys == null || keys.getFirst() == null ||
                      keys.getFirst().length == 0) {
                  throw new IOException("Expecting at least one region for 
table : "
                          + tableNameStr);
              }
              int count = 0;
              byte[] startRow = scan.getStartRow();
              byte[] stopRow = scan.getStopRow();
              for (int i = 0; i < keys.getFirst().length; i++) {
                  if (!includeRegionInSplit(keys.getFirst()[i], 
keys.getSecond()[i])) {
                      continue;
                  }

                  // determine if the given start and stop keys fall into the 
range
                  if ((startRow.length == 0 || keys.getSecond()[i].length == 0 
|| Bytes.compareTo(startRow,
                          keys.getSecond()[i]) < 0) && (stopRow.length == 0 || 
Bytes.compareTo(stopRow,
                          keys.getFirst()[i]) > 0)) {
                      byte[] splitStart = startRow.length == 0 || 
Bytes.compareTo(keys.getFirst()[i],
                              startRow) >= 0 ? keys.getFirst()[i] : startRow;
                      byte[] splitStop = (stopRow.length == 0 || 
Bytes.compareTo(keys.getSecond()[i],
                              stopRow) <= 0) && keys.getSecond()[i].length > 0 
? keys.getSecond()[i] : stopRow;
                      String regionLocation = 
table.getRegionLocation(keys.getFirst()[i], false).getHostname();
                      InputSplit split = new 
TableSplit(Bytes.toBytes(tableNameStr), scan, splitStart,
                              splitStop, regionLocation);
                      splits.add(split);
                      if (LOG.isDebugEnabled())
                          LOG.debug("getSplits: split -> " + (count++) + " -> " 
+ split);
                  }
              }
          }
          table.close();
      }

      return splits;
  }
{code}


was (Author: pengyuanbo):
use AdmasterMultiTableInputFormatBase.class significantly reduce the time of 
the function getSplits, when the situation is one table multi scans or multi 
table one scan or multi table multi scans. Because in function getSplits of 
AdmasterMultiTableInputFormatBase,the operation of table.getStartEndKeys() is 
very time-consuming;We made the following modifications: the same table, take 
only executes the operation of table.getStartEndKeys() once,it will greatly 
shorten the time.

@Override
  public List<InputSplit> getSplits(JobContext context) throws IOException {
      if (scans.isEmpty()) {
          throw new IOException("No scans were provided.");
      }

      HashMap<String, List<Scan>> tableMaps = new HashMap<String, List<Scan>>();
      for (Scan scan : scans) {
          String tableNameStr = 
Bytes.toString(scan.getAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME));
          if (tableMaps.containsKey(tableNameStr)) {
              tableMaps.get(tableNameStr).add(scan);
          } else {
              ArrayList<Scan> scanList = new ArrayList<Scan>();
              scanList.add(scan);
              tableMaps.put(tableNameStr, scanList);
          }
      }

      List<InputSplit> splits = new ArrayList<InputSplit>();
      Iterator iter = tableMaps.entrySet().iterator();
      while (iter.hasNext()) {
          Map.Entry<String, List<Scan>> entry = (Map.Entry<String, List<Scan>>) 
iter.next();
          String tableNameStr = entry.getKey();
          List<Scan> scanList = entry.getValue();
          HTable table = new HTable(context.getConfiguration(), tableNameStr);
          Pair<byte[][], byte[][]> keys = table.getStartEndKeys();
          for (Scan scan : scanList) {
              if (tableNameStr == null)
                  throw new IOException("A scan object did not have a table 
name");
              if (keys == null || keys.getFirst() == null ||
                      keys.getFirst().length == 0) {
                  throw new IOException("Expecting at least one region for 
table : "
                          + tableNameStr);
              }
              int count = 0;
              byte[] startRow = scan.getStartRow();
              byte[] stopRow = scan.getStopRow();
              for (int i = 0; i < keys.getFirst().length; i++) {
                  if (!includeRegionInSplit(keys.getFirst()[i], 
keys.getSecond()[i])) {
                      continue;
                  }

                  // determine if the given start and stop keys fall into the 
range
                  if ((startRow.length == 0 || keys.getSecond()[i].length == 0 
|| Bytes.compareTo(startRow,
                          keys.getSecond()[i]) < 0) && (stopRow.length == 0 || 
Bytes.compareTo(stopRow,
                          keys.getFirst()[i]) > 0)) {
                      byte[] splitStart = startRow.length == 0 || 
Bytes.compareTo(keys.getFirst()[i],
                              startRow) >= 0 ? keys.getFirst()[i] : startRow;
                      byte[] splitStop = (stopRow.length == 0 || 
Bytes.compareTo(keys.getSecond()[i],
                              stopRow) <= 0) && keys.getSecond()[i].length > 0 
? keys.getSecond()[i] : stopRow;
                      String regionLocation = 
table.getRegionLocation(keys.getFirst()[i], false).getHostname();
                      InputSplit split = new 
TableSplit(Bytes.toBytes(tableNameStr), scan, splitStart,
                              splitStop, regionLocation);
                      splits.add(split);
                      if (LOG.isDebugEnabled())
                          LOG.debug("getSplits: split -> " + (count++) + " -> " 
+ split);
                  }
              }
          }
          table.close();
      }

      return splits;
  }

> MultiTableInputFormatBase.getSplits is too slow
> -----------------------------------------------
>
>                 Key: HBASE-12223
>                 URL: https://issues.apache.org/jira/browse/HBASE-12223
>             Project: HBase
>          Issue Type: Improvement
>          Components: Client
>    Affects Versions: 0.94.15
>            Reporter: shanwen
>            Priority: Minor
>
> when use Multiple scan,getSplits is to slow,800 scans take five minutes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HBASE-12223) MultiTableInputFormatBase.getSplits is too slow

Reply via email to