bbeaudreault commented on code in PR #4770:
URL: https://github.com/apache/hbase/pull/4770#discussion_r1083067144


##########
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/WALPlayer.java:
##########
@@ -88,19 +104,23 @@ protected WALPlayer(final Configuration c) {
    */
   @Deprecated
   static class WALKeyValueMapper extends Mapper<WALKey, WALEdit, 
ImmutableBytesWritable, KeyValue> {
-    private byte[] table;
+    private Set<String> tableSet = new HashSet<String>();
+    private boolean multiTableSupport = false;
 
     @Override
     public void map(WALKey key, WALEdit value, Context context) throws 
IOException {
       try {
-        // skip all other tables
-        if (Bytes.equals(table, key.getTableName().getName())) {
+        TableName table = key.getTableName();
+        if (tableSet.contains(table.getNameAsString())) {
           for (Cell cell : value.getCells()) {
-            KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
-            if (WALEdit.isMetaEditFamily(kv)) {
+            if (WALEdit.isMetaEditFamily(cell)) {
               continue;
             }
-            context.write(new ImmutableBytesWritable(CellUtil.cloneRow(kv)), 
kv);
+            byte[] outKey = multiTableSupport
+              ? Bytes.add(table.getName(), Bytes.toBytes(tableSeparator),
+                CellUtil.cloneRow(KeyValueUtil.ensureKeyValue(cell)))
+              : CellUtil.cloneRow(KeyValueUtil.ensureKeyValue(cell));
+            context.write(new ImmutableBytesWritable(outKey), 
KeyValueUtil.ensureKeyValue(cell));

Review Comment:
   I fear this block can cause a performance regression. Taking a look at 
KeyValueUtil.ensureKeyValue, depending on the type of `cell`, `ensureKeyValue` 
can result in the creation of a new KeyValue object. Unnecessary object 
creation in hadoop jobs can be expensive over a long job.
   
   Can you please update this to only call `KeyValueUtil.ensureKeyValue` once 
for each `cell`?



##########
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/WALPlayer.java:
##########
@@ -71,6 +76,17 @@ public class WALPlayer extends Configured implements Tool {
   public final static String TABLE_MAP_KEY = "wal.input.tablesmap";
   public final static String INPUT_FILES_SEPARATOR_KEY = "wal.input.separator";
   public final static String IGNORE_MISSING_FILES = 
"wal.input.ignore.missing.files";
+  public final static String MULTI_TABLES_SUPPORT = "wal.multi.tables.support";
+
+  protected static final String tableSeparator = ";";
+
+  // This relies on Hadoop Configuration to handle warning about deprecated 
configs and
+  // to set the correct non-deprecated configs when an old one shows up.
+  static {
+    Configuration.addDeprecation("hlog.bulk.output", BULK_OUTPUT_CONF_KEY);
+    Configuration.addDeprecation("hlog.input.tables", TABLES_KEY);
+    Configuration.addDeprecation("hlog.input.tablesmap", TABLE_MAP_KEY);
+  }

Review Comment:
   Was there a reason to add these? They aren't an issue per-se, but I noticed 
these deprecations are already in HBaseConfiguration class.



##########
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/WALPlayer.java:
##########
@@ -340,19 +357,21 @@ public Job createSubmittableJob(String[] args) throws 
IOException {
       LOG.debug("add incremental job :" + hfileOutPath + " from " + inputDirs);
 
       // the bulk HFile case
-      if (tables.length != 1) {
-        throw new IOException("Exactly one table must be specified for the 
bulk export option");
-      }
-      TableName tableName = TableName.valueOf(tables[0]);
+      List<TableName> tableNames = getTableNameList(tables);
+
       job.setMapperClass(WALCellMapper.class);
       job.setReducerClass(CellSortReducer.class);
       Path outputDir = new Path(hfileOutPath);
       FileOutputFormat.setOutputPath(job, outputDir);
       job.setMapOutputValueClass(MapReduceExtendedCell.class);
-      try (Connection conn = ConnectionFactory.createConnection(conf);
-        Table table = conn.getTable(tableName);
-        RegionLocator regionLocator = conn.getRegionLocator(tableName)) {
-        HFileOutputFormat2.configureIncrementalLoad(job, 
table.getDescriptor(), regionLocator);
+      try (Connection conn = ConnectionFactory.createConnection(conf);) {
+        List<TableInfo> tableInfoList = new ArrayList<>();
+        for (TableName tableName : tableNames) {
+          Table table = conn.getTable(tableName);
+          RegionLocator regionLocator = conn.getRegionLocator(tableName);
+          tableInfoList.add(new TableInfo(table.getDescriptor(), 
regionLocator));
+        }
+        MultiTableHFileOutputFormat.configureIncrementalLoad(job, 
tableInfoList);

Review Comment:
   Ok so I think we might want to make one more compatibility fix here:
   
   Previously, if a user called this method with more than 1 table it would 
throw an exception.
   Now, those tables will be collected and passed into 
MultiTableHFileOutputFormat.
   
   That is fine I think. However, if you call this method with just 1 table (as 
most users would do today since > 1 throws an error), you also will go to 
MultiTableHFileOutputFormat.
   
   I think on line 374 here we should just be defensive and do this:
   
   ```
   if (tableInfoList.size() > 1) {
     MultiTableHFileOutputFormat.configureIncrementalLoad(job, tableInfoList);
   } else {
     TableInfo tableInfo = tableInfoList.get(0);
     HFileOutputFormat2.configureIncrementalLoad(job, 
tableInfo.getTableDescriptor(), tableInfo.getRegionLocator());
   }
   ```
   
   This is important because if you look at 
HFileOutputFormat2.configureIncrementalLoad there is this:
   
   ```
   boolean writeMultipleTables = false;
   if (MultiTableHFileOutputFormat.class.equals(cls)) {
     writeMultipleTables = true;
     conf.setBoolean(MULTI_TABLE_HFILEOUTPUTFORMAT_CONF_KEY, true);
   }
   ```
   
   So this change here would cause all users of WALPlayer to automatically pick 
up the `writeMultipleTables` behavior, even if they just use 1 table. The 
writeMultipleTables behavior involves a bunch of differences from partitioner, 
splits, output directory, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to