bbeaudreault commented on code in PR #4770:
URL: https://github.com/apache/hbase/pull/4770#discussion_r1083067144
##########
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/WALPlayer.java:
##########
@@ -88,19 +104,23 @@ protected WALPlayer(final Configuration c) {
*/
@Deprecated
static class WALKeyValueMapper extends Mapper<WALKey, WALEdit,
ImmutableBytesWritable, KeyValue> {
- private byte[] table;
+ private Set<String> tableSet = new HashSet<String>();
+ private boolean multiTableSupport = false;
@Override
public void map(WALKey key, WALEdit value, Context context) throws
IOException {
try {
- // skip all other tables
- if (Bytes.equals(table, key.getTableName().getName())) {
+ TableName table = key.getTableName();
+ if (tableSet.contains(table.getNameAsString())) {
for (Cell cell : value.getCells()) {
- KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
- if (WALEdit.isMetaEditFamily(kv)) {
+ if (WALEdit.isMetaEditFamily(cell)) {
continue;
}
- context.write(new ImmutableBytesWritable(CellUtil.cloneRow(kv)),
kv);
+ byte[] outKey = multiTableSupport
+ ? Bytes.add(table.getName(), Bytes.toBytes(tableSeparator),
+ CellUtil.cloneRow(KeyValueUtil.ensureKeyValue(cell)))
+ : CellUtil.cloneRow(KeyValueUtil.ensureKeyValue(cell));
+ context.write(new ImmutableBytesWritable(outKey),
KeyValueUtil.ensureKeyValue(cell));
Review Comment:
I fear this block can cause a performance regression. Taking a look at
KeyValueUtil.ensureKeyValue, depending on the type of `cell`, `ensureKeyValue`
can result in the creation of a new KeyValue object. Unnecessary object
creation in hadoop jobs can be expensive over a long job.
Can you please update this to only call `KeyValueUtil.ensureKeyValue` once
for each `cell`?
##########
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/WALPlayer.java:
##########
@@ -71,6 +76,17 @@ public class WALPlayer extends Configured implements Tool {
public final static String TABLE_MAP_KEY = "wal.input.tablesmap";
public final static String INPUT_FILES_SEPARATOR_KEY = "wal.input.separator";
public final static String IGNORE_MISSING_FILES =
"wal.input.ignore.missing.files";
+ public final static String MULTI_TABLES_SUPPORT = "wal.multi.tables.support";
+
+ protected static final String tableSeparator = ";";
+
+ // This relies on Hadoop Configuration to handle warning about deprecated
configs and
+ // to set the correct non-deprecated configs when an old one shows up.
+ static {
+ Configuration.addDeprecation("hlog.bulk.output", BULK_OUTPUT_CONF_KEY);
+ Configuration.addDeprecation("hlog.input.tables", TABLES_KEY);
+ Configuration.addDeprecation("hlog.input.tablesmap", TABLE_MAP_KEY);
+ }
Review Comment:
Was there a reason to add these? They aren't an issue per-se, but I noticed
these deprecations are already in HBaseConfiguration class.
##########
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/WALPlayer.java:
##########
@@ -340,19 +357,21 @@ public Job createSubmittableJob(String[] args) throws
IOException {
LOG.debug("add incremental job :" + hfileOutPath + " from " + inputDirs);
// the bulk HFile case
- if (tables.length != 1) {
- throw new IOException("Exactly one table must be specified for the
bulk export option");
- }
- TableName tableName = TableName.valueOf(tables[0]);
+ List<TableName> tableNames = getTableNameList(tables);
+
job.setMapperClass(WALCellMapper.class);
job.setReducerClass(CellSortReducer.class);
Path outputDir = new Path(hfileOutPath);
FileOutputFormat.setOutputPath(job, outputDir);
job.setMapOutputValueClass(MapReduceExtendedCell.class);
- try (Connection conn = ConnectionFactory.createConnection(conf);
- Table table = conn.getTable(tableName);
- RegionLocator regionLocator = conn.getRegionLocator(tableName)) {
- HFileOutputFormat2.configureIncrementalLoad(job,
table.getDescriptor(), regionLocator);
+ try (Connection conn = ConnectionFactory.createConnection(conf);) {
+ List<TableInfo> tableInfoList = new ArrayList<>();
+ for (TableName tableName : tableNames) {
+ Table table = conn.getTable(tableName);
+ RegionLocator regionLocator = conn.getRegionLocator(tableName);
+ tableInfoList.add(new TableInfo(table.getDescriptor(),
regionLocator));
+ }
+ MultiTableHFileOutputFormat.configureIncrementalLoad(job,
tableInfoList);
Review Comment:
Ok so I think we might want to make one more compatibility fix here:
Previously, if a user called this method with more than 1 table it would
throw an exception.
Now, those tables will be collected and passed into
MultiTableHFileOutputFormat.
That is fine I think. However, if you call this method with just 1 table (as
most users would do today since > 1 throws an error), you also will go to
MultiTableHFileOutputFormat.
I think on line 374 here we should just be defensive and do this:
```
if (tableInfoList.size() > 1) {
MultiTableHFileOutputFormat.configureIncrementalLoad(job, tableInfoList);
} else {
TableInfo tableInfo = tableInfoList.get(0);
HFileOutputFormat2.configureIncrementalLoad(job,
tableInfo.getTableDescriptor(), tableInfo.getRegionLocator());
}
```
This is important because if you look at
HFileOutputFormat2.configureIncrementalLoad there is this:
```
boolean writeMultipleTables = false;
if (MultiTableHFileOutputFormat.class.equals(cls)) {
writeMultipleTables = true;
conf.setBoolean(MULTI_TABLE_HFILEOUTPUTFORMAT_CONF_KEY, true);
}
```
So this change here would cause all users of WALPlayer to automatically pick
up the `writeMultipleTables` behavior, even if they just use 1 table. The
writeMultipleTables behavior involves a bunch of differences from partitioner,
splits, output directory, etc.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]