Copilot commented on code in PR #6347:
URL: https://github.com/apache/hive/pull/6347#discussion_r2887932956
##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##########
@@ -1025,12 +1024,12 @@ private List<Partition> getPartitionsByNames(String
catName, String dbName,
+ " left outer join " + SERDES + " on " + SDS + ".\"SERDE_ID\" = " +
SERDES + ".\"SERDE_ID\" "
+ " inner join " + TBLS + " on " + TBLS + ".\"TBL_ID\" = " +
PARTITIONS + ".\"TBL_ID\" "
+ " inner join " + DBS + " on " + DBS + ".\"DB_ID\" = " + TBLS +
".\"DB_ID\" "
- + " where \"PART_NAME\" in (" + quotedPartNames + ") "
+ + " where \"PART_NAME\" in (" + makeParams(partNameList.size()) + ") "
+ " and " + TBLS + ".\"TBL_NAME\" = ? and " + DBS + ".\"NAME\" = ? and
" + DBS
+ ".\"CTLG_NAME\" = ? order by \"PART_NAME\" asc";
-
- Object[] params = new Object[]{tblName, dbName, catName};
- return getPartitionsByQuery(catName, dbName, tblName, queryText, params,
isAcidTable, args);
+ List params = new ArrayList<>(partNameList);
Review Comment:
`List params = new ArrayList<>(partNameList);` uses a raw type even though
the list is intended to hold parameters of a known type (all are `String`
here). Use a typed collection (e.g., `List<Object>` or `List<String>`) to avoid
rawtype warnings and keep the parameter construction type-safe.
```suggestion
List<String> params = new ArrayList<>(partNameList);
```
##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/DirectSqlAggrStats.java:
##########
@@ -184,34 +185,19 @@ public List<ColumnStatisticsObj>
columnStatisticsObjForPartitions(
final double ndvTuner, final boolean enableBitVector,
boolean enableKll) throws MetaException {
final boolean areAllPartsFound = (partsFound == partNames.size());
- return Batchable.runBatched(batchSize, colNames, new Batchable<String,
ColumnStatisticsObj>() {
+ int batch = dbType.getMaxBatch(batchSize, colNames.size() +
partNames.size() + 4);
+ return Batchable.runBatched(batch, colNames, new Batchable<String,
ColumnStatisticsObj>() {
@Override
public List<ColumnStatisticsObj> run(final List<String> inputColNames)
throws MetaException {
- return columnStatisticsObjForPartitionsBatch(catName, dbName,
tableName,
- partNames, inputColNames, engine, areAllPartsFound,
- useDensityFunctionForNDVEstimation, ndvTuner, enableBitVector,
enableKll);
- }
- });
- }
-
- /**
- * Should be called with the list short enough to not trip up Oracle/etc.
- */
- private List<ColumnStatisticsObj> columnStatisticsObjForPartitionsBatch(
- String catName,
- String dbName, String tableName,
- List<String> partNames, List<String> colNames, String engine,
- boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation,
- double ndvTuner, boolean enableBitVector,
- boolean enableKll) throws MetaException {
- if (enableBitVector || enableKll) {
- return aggrStatsUseJava(catName, dbName, tableName, partNames,
- colNames, engine, areAllPartsFound,
useDensityFunctionForNDVEstimation,
- ndvTuner, enableBitVector, enableKll);
- } else {
- return aggrStatsUseDB(catName, dbName, tableName, partNames, colNames,
engine,
- useDensityFunctionForNDVEstimation, ndvTuner);
- }
+ if (enableBitVector || enableKll) {
+ return aggrStatsUseJava(catName, dbName, tableName, partNames,
+ colNames, engine, areAllPartsFound,
useDensityFunctionForNDVEstimation,
+ ndvTuner, enableBitVector, enableKll);
+ } else {
+ return aggrStatsUseDB(catName, dbName, tableName, partNames,
colNames, engine,
Review Comment:
The batching callback ignores the per-batch `inputColNames` and instead
passes the full `colNames` list into `aggrStatsUseJava/aggrStatsUseDB`. This
defeats batching (can still exceed the DB parameter limit) and will duplicate
results because each batch run processes all columns again. Use `inputColNames`
in the per-batch query/aggregation paths (and only compute the batch size based
on the parameters for that single batch).
```suggestion
inputColNames, engine, areAllPartsFound,
useDensityFunctionForNDVEstimation,
ndvTuner, enableBitVector, enableKll);
} else {
return aggrStatsUseDB(catName, dbName, tableName, partNames,
inputColNames, engine,
```
##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/DatabaseProduct.java:
##########
@@ -852,10 +852,23 @@ public int getMaxRows(int batch, int paramSize) {
// SQL Server supports a maximum of 2100 parameters in a request. Adjust
the maxRowsInBatch accordingly
int maxAllowedRows = (2100 - paramSize) / paramSize;
return Math.min(batch, maxAllowedRows);
+ } else if (isPOSTGRES()) {
+ int maxAllowedRows = (32767 - paramSize) / paramSize;
+ return Math.min(batch, maxAllowedRows);
}
return batch;
}
+ public int getMaxBatch(int batch, int totalParameters) {
+ int minBatch = batch;
+ if (isSQLSERVER()) {
+ minBatch = (totalParameters + 2100) / 2100;
+ } else if (isPOSTGRES()) {
+ minBatch = (totalParameters + 32767) / 32767;
+ }
+ return batch <= 0 ? minBatch : Math.max(batch, minBatch);
Review Comment:
`getMaxBatch` currently computes a value that looks like a *number of
batches* (ceil(totalParameters/limit)) and then is used as the
`Batchable.runBatched` *batch size* throughout the PR. For `batchSize ==
NO_BATCHING (-1)` this will often return `1`, turning what used to be a single
query into N single-item queries (major performance regression) and still not
enforcing the per-statement parameter limit. Rework this helper to return a
safe *batch size* (max items per batch) based on the DB parameter limit and
fixed/variable parameter counts, and ensure it never returns 0.
```suggestion
// totalParameters is treated as the per-item parameter count. Compute
the maximum
// number of items we can safely include in a single batch without
exceeding the
// database's per-statement parameter limit.
int maxItemsPerBatch;
if (totalParameters <= 0) {
// If we cannot determine per-item parameter usage, fall back to the
configured
// batch size without applying DB-specific limits.
return batch;
}
if (isSQLSERVER()) {
// SQL Server supports a maximum of 2100 parameters in a request.
maxItemsPerBatch = 2100 / totalParameters;
} else if (isPOSTGRES()) {
// PostgreSQL supports a maximum of 32767 parameters in a request.
maxItemsPerBatch = 32767 / totalParameters;
} else {
// No known hard parameter limit; use the configured batch size as-is.
return batch;
}
// Ensure we never return 0; at least one item per batch is always
allowed.
if (maxItemsPerBatch <= 0) {
maxItemsPerBatch = 1;
}
// If batch <= 0 (e.g., NO_BATCHING), choose the largest safe batch size
that
// respects the DB parameter limit. Otherwise, cap the requested batch
size at
// the safe maximum.
if (batch <= 0) {
return maxItemsPerBatch;
}
return Math.min(batch, maxItemsPerBatch);
```
##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/tools/MetaToolObjectStore.java:
##########
@@ -842,20 +845,21 @@ private void collectTabFormatSummary(List<Long>
transactionalTables, Long tableI
private void collectBasicStats(Map<Long, MetadataTableSummary> summaries,
Set<Long> nonPartedTabs,
Set<Long> partedTabs) throws MetaException {
+ int batch = dbType.getMaxBatch(batchSize, nonPartedTabs.size());
String queryText0 = "select \"TBL_ID\", \"PARAM_KEY\", CAST(" +
dbType.toVarChar("\"PARAM_VALUE\"") + " AS decimal(21,0)) from \"TABLE_PARAMS\"
where \"PARAM_KEY\" " +
"in ('" + StatsSetupConst.TOTAL_SIZE + "', '" +
StatsSetupConst.NUM_FILES + "', '" + StatsSetupConst.ROW_COUNT + "') and
\"TBL_ID\" in (";
- runBatched(batchSize, new ArrayList<>(nonPartedTabs), new Batchable<Long,
Void>() {
+ runBatched(batch, new ArrayList<>(nonPartedTabs), new Batchable<Long,
Void>() {
@Override
public List<Void> run(List<Long> input) throws Exception {
collectBasicStats(queryText0, input, summaries, "");
return Collections.emptyList();
}
});
-
+ batch = dbType.getMaxBatch(batchSize, partedTabs.size());
String queryText1 = "select \"TBL_ID\", \"PARAM_KEY\", sum(CAST(" +
dbType.toVarChar("\"PARAM_VALUE\"") + " AS decimal(21,0))) from \"PARTITIONS\"
t " +
"join \"PARTITION_PARAMS\" p on p.\"PART_ID\" = t.\"PART_ID\" where
\"PARAM_KEY\" " +
"in ('" + StatsSetupConst.TOTAL_SIZE + "', '" +
StatsSetupConst.NUM_FILES + "', '" + StatsSetupConst.ROW_COUNT + "') and
t.\"TBL_ID\" in (";
- runBatched(batchSize, new ArrayList<>(partedTabs), new Batchable<Long,
Void>() {
+ runBatched(batch, new ArrayList<>(partedTabs), new Batchable<Long, Void>() {
@Override
Review Comment:
Indentation is inconsistent for the second half of the method (`batch =
...`, `String queryText1 = ...`, and the following `runBatched` call). If this
module is checked by checkstyle/spotless, this will likely fail formatting
checks. Align these lines with the surrounding 4-space indentation used in this
class.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]