Re: [PR] docs: Migration guide for MVDs to arrays (druid)

2024-06-06 Thread via GitHub


vtlim commented on code in PR #16516:
URL: https://github.com/apache/druid/pull/16516#discussion_r1630030691


##
docs/release-info/migr-mvd-array.md:
##
@@ -0,0 +1,135 @@
+---
+id: migr-mvd-array
+title: "Migration guide: MVDs to arrays"
+sidebar_label: MVDs to arrays
+---
+
+
+
+
+Druid now supports SQL-compliant [arrays](../querying/arrays.md), and we 
recommend that people use arrays over [multi-value 
dimensions](../querying/multi-value-dimensions.md) (MVDs) whenever possible.
+Use arrays for new projects and complex use cases involving multiple data 
types. Use MVDs for specific use cases, such as operating directly on 
individual elements like regular strings. If your operations involve complete 
arrays of values, including the ordering of values within a row, use arrays 
over MVDs.
+
+## Comparison between arrays and MVDs
+
+The following table compares the general behavior between arrays and MVDs.
+For specific query differences between arrays and MVDs, see [Query differences 
between arrays and MVDs](#query-differences-between-arrays-and-mvds).
+
+|  | Arrays| Multi-value dimensions (MVDs) |
+|---|---|---|
+| Data types | Supports VARCHAR, BIGINT, and DOUBLE types (ARRAY, 
ARRAY, ARRAY) | Only supports arrays of strings (VARCHAR) |
+| SQL compliance | Behaves like standard SQL arrays with SQL-compliant 
behavior | Does not behave like standard SQL arrays; requires special SQL 
functions |
+| Ingestion | JSON arrays are ingested as Druid arraysManaged 
through the query context parameter `arrayIngestMode` in SQL-based ingestion 
(supported options: `array`, `mvd`, `none`). Note that if you set this mode to 
`none`, Druid raises an exception if you try to store any type of 
array. | JSON arrays are ingested as multi-value 
dimensionsManaged using functions like 
[ARRAY_TO_MV](../querying/sql-functions.md#array_to_mv) in SQL-based 
ingestion |
+| Filtering and grouping | Filters and groupings match the entire 
array valueCan be used as GROUP BY keys, grouping based on the entire 
array value | Filters match any value within the 
arrayGrouping generates a group for each individual value, similar to 
an implicit UNNEST |
+| Conversion | Convert an MVD to an array using 
[MV_TO_ARRAY](../querying/sql-functions.md#mv_to_array) | Convert an array to 
an MVD using [ARRAY_TO_MV](../querying/sql-functions.md#array_to_mv) |
+
+## Query differences between arrays and MVDs
+
+In SQL queries, Druid operates on arrays differently than MVDs.
+A value in an array column is treated as a single array entity (SQL ARRAY), 
whereas a value in an MVD column is treated as individual strings (SQL VARCHAR).
+
+For example, consider the same value, `['a', 'b', 'c']` ingested into an array 
column and an MVD column.
+In your query, you want to filter results by comparing some value with `['a', 
'b', 'c']`.
+
+* For array columns, Druid only returns the row when an equality filter 
matches the entire array.  
+For example: `WHERE "array_column" = ARRAY['a', 'b', 'c']`.
+
+* For MVD columns, Druid returns the row when an equality filter matches any 
value of the MVD.  
+For example, any of the following filters returns the row for the query:  
+`WHERE "mvd_column" = 'a'`  
+`WHERE "mvd_column" = 'b'`  
+`WHERE "mvd_column" = 'c'`
+
+Note this difference between arrays and MVDs when you write queries that 
involve filtering or grouping.
+
+The following examples highlight a few analogous queries between arrays and 
MVDs.
+For more information and examples, see [Querying 
arrays](../querying/arrays.md#querying-arrays) and [Querying multi-value 
dimensions](../querying/multi-value-dimensions.md#querying-multi-value-dimensions).
+
+### Example: an element in array
+
+Filter rows that have a certain value in the array or MVD.
+
+ Array
+
+```sql
+SELECT *
+FROM "array_example"
+WHERE ARRAY_CONTAINS(tags, 't3')
+```
+
+ MVD
+
+```sql
+SELECT *
+FROM "mvd_example"
+WHERE tags = 't3'
+```
+
+### Example: overlap of two arrays
+
+Filter rows for which the array or MVD overlaps a reference array.
+
+ Array
+
+```sql
+SELECT *
+FROM "array_example"
+WHERE ARRAY_OVERLAP(tags, ARRAY['t1', 't7'])
+```
+
+ MVD
+
+```sql
+SELECT *
+FROM "mvd_example"
+WHERE MV_OVERLAP(tags, ARRAY['t1', 't7'])
+```

Review Comment:
   @2bethere you mentioned adding an example for ARRAY_OVERLAP but these 
examples aren't so different between MVDs and arrays. Did you have another use 
case in mind?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Update site for Druid 30 [druid-website]

2024-06-06 Thread via GitHub


adarshsanjeev opened a new pull request, #277:
URL: https://github.com/apache/druid-website/pull/277

   Update 30.0.0 staging site
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] docs: remove outdated druidversion var from a page (druid)

2024-06-06 Thread via GitHub


317brian opened a new pull request, #16570:
URL: https://github.com/apache/druid/pull/16570

   The `{{DRUIDVERSION}}` bits on this page were added awhile ago:
   
   Line 368: last year (26.0.0)
   Line 434: 4 years ago
   
   I think they should have been hardcoded at the time since the statements 
mark a change in behavior. But since they were both so long ago, it seems like 
we can just remove them instead of trying to track down the version they 
correspond to.
   
   At a minimum, I think we should remove the 4 yr old reference. If we wanted 
to, we can update line 368 to the actual version
   
   This PR has:
   
   - [x] been self-reviewed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Support Dynamic Peon Pod Template Selection in K8s extension (druid)

2024-06-06 Thread via GitHub


suneet-s commented on code in PR #16510:
URL: https://github.com/apache/druid/pull/16510#discussion_r1630003561


##
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/execution/ExecutionConfig.java:
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.k8s.overlord.execution;
+
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+
+/**
+ * Represents the configuration for task execution within a Kubernetes 
environment.
+ * This interface allows for dynamic configuration of task execution 
strategies based
+ * on specified behavior strategies.
+ */
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = 
DefaultExecutionConfig.class)
+@JsonSubTypes(value = {
+@JsonSubTypes.Type(name = "default", value = DefaultExecutionConfig.class)
+})
+public interface ExecutionConfig
+{
+  String CONFIG_KEY = "k8s.taskrunner.config";

Review Comment:
   YongGang and I discussed this offline, and I better understand the intent of 
this config object. It seems like this is trying to provide similar 
functionality as `KubernetesTaskRunnerConfig` but via the dynamic config. It 
makes sense to have an encompassing dynamic config object for this extension. 
   
   Some suggested names for this that better indicate it's purpose 
`KubernetesTaskRunnerDynamicConfig` `KubernetesPeonDynamicConfig` 
`KubernetesTaskExecutionConfig`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Support Dynamic Peon Pod Template Selection in K8s extension (druid)

2024-06-06 Thread via GitHub


suneet-s commented on code in PR #16510:
URL: https://github.com/apache/druid/pull/16510#discussion_r1630003561


##
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/execution/ExecutionConfig.java:
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.k8s.overlord.execution;
+
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+
+/**
+ * Represents the configuration for task execution within a Kubernetes 
environment.
+ * This interface allows for dynamic configuration of task execution 
strategies based
+ * on specified behavior strategies.
+ */
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = 
DefaultExecutionConfig.class)
+@JsonSubTypes(value = {
+@JsonSubTypes.Type(name = "default", value = DefaultExecutionConfig.class)
+})
+public interface ExecutionConfig
+{
+  String CONFIG_KEY = "k8s.taskrunner.config";

Review Comment:
   YongGang and I discussed this offline, and I better understand the intent of 
this config object. It seems like this is trying to provide similar 
functionality as `KubernetesTaskRunnerConfig` but via the dynamic config. It 
makes sense to have an encompassing dynamic config object for this extension. 
Some suggested names for this that better indicate it's purpose 
`KubernetesTaskRunnerDynamicConfig` `KubernetesPeonDynamicConfig` 
`KubernetesTaskExecutionConfig`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Update site for Druid 30 [druid-website-src]

2024-06-06 Thread via GitHub


adarshsanjeev opened a new pull request, #479:
URL: https://github.com/apache/druid-website-src/pull/479

   Update 30.0.0 staging site
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Add site for Druid 30 [druid-website-src]

2024-06-06 Thread via GitHub


adarshsanjeev closed pull request #478: Add site for Druid 30
URL: https://github.com/apache/druid-website-src/pull/478


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [I] Bring Azure functionality to core functionality (druid)

2024-06-06 Thread via GitHub


zachjsh closed issue #9240: Bring Azure functionality to core functionality
URL: https://github.com/apache/druid/issues/9240


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [I] Bring Azure functionality to core functionality (druid)

2024-06-06 Thread via GitHub


zachjsh commented on issue #9240:
URL: https://github.com/apache/druid/issues/9240#issuecomment-2153081065

   This work was done in https://github.com/apache/druid/pull/9394, and 
previous commits that improved quality to this extension. Closing this out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Move Azure extension into Core (druid)

2024-06-06 Thread via GitHub


amaechler commented on PR #9394:
URL: https://github.com/apache/druid/pull/9394#issuecomment-2152892043

   Closes #9240.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-06 Thread via GitHub


kfaraz commented on PR #16481:
URL: https://github.com/apache/druid/pull/16481#issuecomment-2152891872

   Happy to help, @Akshat-Jain !  
   Thank you for the changes and for the discussion!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] [Backport] Fix backwards compatibility with centralized schema config in partial_index_merge tasks (druid)

2024-06-06 Thread via GitHub


cryptoe merged PR #16566:
URL: https://github.com/apache/druid/pull/16566


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Pin Curator dependencies to 5.3.0 util CURATOR-696 has been resolved (druid)

2024-06-06 Thread via GitHub


cryptoe closed pull request #16444: Pin Curator dependencies to 5.3.0 util 
CURATOR-696 has been resolved
URL: https://github.com/apache/druid/pull/16444


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Pin Curator dependencies to 5.3.0 util CURATOR-696 has been resolved (druid)

2024-06-06 Thread via GitHub


cryptoe commented on PR #16444:
URL: https://github.com/apache/druid/pull/16444#issuecomment-2152787612

   Since we are already on the path for : 
https://github.com/apache/druid/pull/16528 closing this ticke.t 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-06 Thread via GitHub


Akshat-Jain commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1629670678


##
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.storage.s3.output;
+
+import com.amazonaws.services.s3.model.UploadPartRequest;
+import com.amazonaws.services.s3.model.UploadPartResult;
+import com.google.common.annotations.VisibleForTesting;
+import com.google.inject.Inject;
+import org.apache.druid.guice.ManageLifecycle;
+import org.apache.druid.java.util.common.RetryUtils;
+import org.apache.druid.java.util.common.concurrent.Execs;
+import org.apache.druid.java.util.common.lifecycle.LifecycleStart;
+import org.apache.druid.java.util.common.lifecycle.LifecycleStop;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.druid.storage.s3.S3Utils;
+import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3;
+import org.apache.druid.utils.RuntimeInfo;
+
+import java.io.File;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+
+/**
+ * This class manages uploading files to S3 in chunks, while ensuring that the
+ * number of chunks currently present on local disk does not exceed a specific 
limit.
+ */
+@ManageLifecycle
+public class S3UploadManager
+{
+  private final ExecutorService uploadExecutor;
+
+  private static final Logger log = new Logger(S3UploadManager.class);
+
+  @Inject
+  public S3UploadManager(S3OutputConfig s3OutputConfig, S3ExportConfig 
s3ExportConfig, RuntimeInfo runtimeInfo)
+  {
+int poolSize = Math.max(4, runtimeInfo.getAvailableProcessors());
+int maxNumConcurrentChunks = computeMaxNumConcurrentChunks(s3OutputConfig, 
s3ExportConfig);
+this.uploadExecutor = createExecutorService(poolSize, 
maxNumConcurrentChunks);
+log.info("Initialized executor service for S3 multipart upload with pool 
size [%d] and work queue capacity [%d]",
+ poolSize, maxNumConcurrentChunks);
+  }
+
+  /**
+   * Computes the maximum number of concurrent chunks for an S3 multipart 
upload.
+   * We want to determine the maximum number of concurrent chunks on disk 
based on the maximum value of chunkSize
+   * between the 2 configs: S3OutputConfig and S3ExportConfig.
+   *
+   * @param s3OutputConfig  The S3 output configuration, which may specify a 
custom chunk size.
+   * @param s3ExportConfig  The S3 export configuration, which may also 
specify a custom chunk size.
+   * @return The maximum number of concurrent chunks.
+   */
+  @VisibleForTesting
+  int computeMaxNumConcurrentChunks(S3OutputConfig s3OutputConfig, 
S3ExportConfig s3ExportConfig)

Review Comment:
   Resolving this comment as per the discussion at 
https://github.com/apache/druid/pull/16481#discussion_r1629278464



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-06 Thread via GitHub


Akshat-Jain commented on PR #16481:
URL: https://github.com/apache/druid/pull/16481#issuecomment-2152681440

   @kfaraz Thank you for the extremely exhaustive review comments, got to learn 
a LOT from them throughout this PR!  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [I] [DRAFT] 30.0.0 release notes (druid)

2024-06-06 Thread via GitHub


adarshsanjeev commented on issue #16505:
URL: https://github.com/apache/druid/issues/16505#issuecomment-2152675032

   @frankgrimes97 Currently, Druid 30 is at a stage where only regressions can 
be backported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Remove incorrect utf8 conversion of ResultCache keys (druid)

2024-06-06 Thread via GitHub


github-advanced-security[bot] commented on code in PR #16569:
URL: https://github.com/apache/druid/pull/16569#discussion_r1629654479


##
server/src/test/java/org/apache/druid/client/cache/CaffeineCacheTest.java:
##
@@ -65,6 +69,16 @@
 cache = CaffeineCache.create(cacheConfig);
   }
 
+  @Test
+  public void testKeyContainingNegativeBytest()

Review Comment:
   ## Missing Override annotation
   
   This method overrides 
[CacheTestBase.testKeyContainingNegativeBytest](1); it is 
advisable to add an Override annotation.
   
   [Show more 
details](https://github.com/apache/druid/security/code-scanning/7440)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [I] /v2 and /v2/candidates endpoints not respecting broker partition pruning - range partition (druid)

2024-06-06 Thread via GitHub


ColeAtCharter commented on issue #16222:
URL: https://github.com/apache/druid/issues/16222#issuecomment-2152602070

   @gianm -- thanks for looking at this.  I do think "Unexpected result example 
1: /druid/v2 endpoint" should never happen if broker partition pruning is 
working.  Druid returned a segmentId that should have been pruned at the 
broker.  That should not happen.  The one hypothetical "exception" is if the 
broker is combining the data query results with its own metadata about segments 
(before pruning) and including a dummy data result.  However, this is a bit of 
a stretch.
   
   Our team can rerun the test and see if there are more metrics to elucidate 
the extent to which the data queries are fanned out even with partition pruning 
enabled


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Remove incorrect utf8 conversion of ResultCache keys (druid)

2024-06-06 Thread via GitHub


kgyrtkirk opened a new pull request, #16569:
URL: https://github.com/apache/druid/pull/16569

   * remove the incorrect call to read the byte[] as utf8
   * not sure about its origin - but this will push the contract for `Cache` 
implementations to handle the `byte[]` in `NamedKey` - its already a `byte[]` - 
so they shouldn't expect any better than that...
   * use `Cache.NamedKey` instead of passing `byte[]` and creating the key at 
multiple places
   * added some test to ensure that cache implementations are able to accept 
such keys
   
   Fixes #16552
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-06 Thread via GitHub


kfaraz commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1629424508


##
extensions-core/s3-extensions/src/test/java/org/apache/druid/storage/s3/output/S3UploadManagerTest.java:
##
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.storage.s3.output;
+
+import com.amazonaws.services.s3.model.UploadPartRequest;
+import com.amazonaws.services.s3.model.UploadPartResult;
+import org.apache.druid.java.util.common.HumanReadableBytes;
+import org.apache.druid.java.util.common.concurrent.Execs;
+import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3;
+import org.apache.druid.utils.RuntimeInfo;
+import org.easymock.Capture;
+import org.easymock.EasyMock;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.File;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+public class S3UploadManagerTest
+{
+
+  private S3UploadManager s3UploadManager;
+  private S3OutputConfig s3OutputConfig;
+  private S3ExportConfig s3ExportConfig;
+  private static ExecutorService uploadExecutor;

Review Comment:
   this should not be static.



##
extensions-core/s3-extensions/src/test/java/org/apache/druid/storage/s3/output/S3UploadManagerTest.java:
##
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.storage.s3.output;
+
+import com.amazonaws.services.s3.model.UploadPartRequest;
+import com.amazonaws.services.s3.model.UploadPartResult;
+import org.apache.druid.java.util.common.HumanReadableBytes;
+import org.apache.druid.java.util.common.concurrent.Execs;
+import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3;
+import org.apache.druid.utils.RuntimeInfo;
+import org.easymock.Capture;
+import org.easymock.EasyMock;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.File;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+public class S3UploadManagerTest
+{
+
+  private S3UploadManager s3UploadManager;
+  private S3OutputConfig s3OutputConfig;
+  private S3ExportConfig s3ExportConfig;
+  private static ExecutorService uploadExecutor;
+
+  @Before
+  public void setUp()
+  {
+s3OutputConfig = EasyMock.mock(S3OutputConfig.class);
+s3ExportConfig = EasyMock.mock(S3ExportConfig.class);

Review Comment:
   Please use concrete objects instead of mocks. `S3OutputConfig` and 
`S3ExportConfig` are plain POJOs that can be easily constructed.



##
extensions-core/s3-extensions/src/test/java/org/apache/druid/storage/s3/output/S3UploadManagerTest.java:
##
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable 

Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-06 Thread via GitHub


kfaraz commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1629419675


##
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.storage.s3.output;
+
+import com.amazonaws.services.s3.model.UploadPartRequest;
+import com.amazonaws.services.s3.model.UploadPartResult;
+import com.google.common.annotations.VisibleForTesting;
+import com.google.inject.Inject;
+import org.apache.druid.guice.ManageLifecycle;
+import org.apache.druid.java.util.common.RetryUtils;
+import org.apache.druid.java.util.common.concurrent.Execs;
+import org.apache.druid.java.util.common.lifecycle.LifecycleStart;
+import org.apache.druid.java.util.common.lifecycle.LifecycleStop;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.druid.storage.s3.S3Utils;
+import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3;
+import org.apache.druid.utils.RuntimeInfo;
+
+import java.io.File;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+
+/**
+ * This class manages uploading files to S3 in chunks, while ensuring that the
+ * number of chunks currently present on local disk does not exceed a specific 
limit.
+ */
+@ManageLifecycle
+public class S3UploadManager
+{
+  private final ExecutorService uploadExecutor;
+
+  private static final Logger log = new Logger(S3UploadManager.class);
+
+  @Inject
+  public S3UploadManager(S3OutputConfig s3OutputConfig, S3ExportConfig 
s3ExportConfig, RuntimeInfo runtimeInfo)
+  {
+int poolSize = Math.max(4, runtimeInfo.getAvailableProcessors());
+int maxNumConcurrentChunks = computeMaxNumConcurrentChunks(s3OutputConfig, 
s3ExportConfig);
+this.uploadExecutor = createExecutorService(poolSize, 
maxNumConcurrentChunks);
+log.info("Initialized executor service for S3 multipart upload with pool 
size [%d] and work queue capacity [%d]",
+ poolSize, maxNumConcurrentChunks);
+  }
+
+  /**
+   * Computes the maximum number of concurrent chunks for an S3 multipart 
upload.
+   * We want to determine the maximum number of concurrent chunks on disk 
based on the maximum value of chunkSize
+   * between the 2 configs: S3OutputConfig and S3ExportConfig.
+   *
+   * @param s3OutputConfig  The S3 output configuration, which may specify a 
custom chunk size.
+   * @param s3ExportConfig  The S3 export configuration, which may also 
specify a custom chunk size.
+   * @return The maximum number of concurrent chunks.
+   */
+  @VisibleForTesting
+  int computeMaxNumConcurrentChunks(S3OutputConfig s3OutputConfig, 
S3ExportConfig s3ExportConfig)
+  {
+long chunkSize = S3OutputConfig.S3_MULTIPART_UPLOAD_MIN_PART_SIZE_BYTES;
+if (s3OutputConfig != null && s3OutputConfig.getChunkSize() != null) {
+  chunkSize = Math.max(chunkSize, s3OutputConfig.getChunkSize());
+}
+if (s3ExportConfig != null && s3ExportConfig.getChunkSize() != null) {
+  chunkSize = Math.max(chunkSize, 
s3ExportConfig.getChunkSize().getBytes());
+}
+
+return (int) (S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES / 
chunkSize);

Review Comment:
   Sounds good.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Update pac4j version (druid)

2024-06-06 Thread via GitHub


pagrawal10 opened a new pull request, #16567:
URL: https://github.com/apache/druid/pull/16567

   
   
   
   
   Fixes #.
   
   
   
   
   
   ### Description
   
   
   
   
   
   
   
    Fixed the bug ...
    Renamed the class ...
    Added a forbidden-apis entry ...
   
   
   
   
   
   
   
   
   
   
    Release note
   
   
   
   
   
   # Key changed/added classes in this PR
* `MyFoo`
* `OurBar`
* `TheirBaz`
   
   
   
   
   
   This PR has:
   
   - [ ] been self-reviewed.
  - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-06 Thread via GitHub


Akshat-Jain commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1629278464


##
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.storage.s3.output;
+
+import com.amazonaws.services.s3.model.UploadPartRequest;
+import com.amazonaws.services.s3.model.UploadPartResult;
+import com.google.common.annotations.VisibleForTesting;
+import com.google.inject.Inject;
+import org.apache.druid.guice.ManageLifecycle;
+import org.apache.druid.java.util.common.RetryUtils;
+import org.apache.druid.java.util.common.concurrent.Execs;
+import org.apache.druid.java.util.common.lifecycle.LifecycleStart;
+import org.apache.druid.java.util.common.lifecycle.LifecycleStop;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.druid.storage.s3.S3Utils;
+import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3;
+import org.apache.druid.utils.RuntimeInfo;
+
+import java.io.File;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+
+/**
+ * This class manages uploading files to S3 in chunks, while ensuring that the
+ * number of chunks currently present on local disk does not exceed a specific 
limit.
+ */
+@ManageLifecycle
+public class S3UploadManager
+{
+  private final ExecutorService uploadExecutor;
+
+  private static final Logger log = new Logger(S3UploadManager.class);
+
+  @Inject
+  public S3UploadManager(S3OutputConfig s3OutputConfig, S3ExportConfig 
s3ExportConfig, RuntimeInfo runtimeInfo)
+  {
+int poolSize = Math.max(4, runtimeInfo.getAvailableProcessors());
+int maxNumConcurrentChunks = computeMaxNumConcurrentChunks(s3OutputConfig, 
s3ExportConfig);
+this.uploadExecutor = createExecutorService(poolSize, 
maxNumConcurrentChunks);
+log.info("Initialized executor service for S3 multipart upload with pool 
size [%d] and work queue capacity [%d]",
+ poolSize, maxNumConcurrentChunks);
+  }
+
+  /**
+   * Computes the maximum number of concurrent chunks for an S3 multipart 
upload.
+   * We want to determine the maximum number of concurrent chunks on disk 
based on the maximum value of chunkSize
+   * between the 2 configs: S3OutputConfig and S3ExportConfig.
+   *
+   * @param s3OutputConfig  The S3 output configuration, which may specify a 
custom chunk size.
+   * @param s3ExportConfig  The S3 export configuration, which may also 
specify a custom chunk size.
+   * @return The maximum number of concurrent chunks.
+   */
+  @VisibleForTesting
+  int computeMaxNumConcurrentChunks(S3OutputConfig s3OutputConfig, 
S3ExportConfig s3ExportConfig)
+  {
+long chunkSize = S3OutputConfig.S3_MULTIPART_UPLOAD_MIN_PART_SIZE_BYTES;
+if (s3OutputConfig != null && s3OutputConfig.getChunkSize() != null) {
+  chunkSize = Math.max(chunkSize, s3OutputConfig.getChunkSize());
+}
+if (s3ExportConfig != null && s3ExportConfig.getChunkSize() != null) {
+  chunkSize = Math.max(chunkSize, 
s3ExportConfig.getChunkSize().getBytes());
+}
+
+return (int) (S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES / 
chunkSize);

Review Comment:
   @kfaraz I had a discussion with Karan about it. The conclusion was that we 
don't need to block this PR for this review comment, and a follow-up task could 
be to add a metric for gathering data on how long are the processing threads 
actually blocked on this constraint. We could then tune this accordingly based 
on the gathered data.
   Hope that works!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] [Backport] Fix backwards compatibility with centralized schema config in partial_index_merge tasks (druid)

2024-06-06 Thread via GitHub


findingrish opened a new pull request, #16566:
URL: https://github.com/apache/druid/pull/16566

   Backports https://github.com/apache/druid/pull/16556 & 
https://github.com/apache/druid/pull/16565 to Druid30.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Log segment signature in AbstractSegmentMetadataCache at debug level (druid)

2024-06-06 Thread via GitHub


cryptoe merged PR #16565:
URL: https://github.com/apache/druid/pull/16565


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Log segment signature in AbstractSegmentMetadataCache at debug level (druid)

2024-06-06 Thread via GitHub


findingrish opened a new pull request, #16565:
URL: https://github.com/apache/druid/pull/16565

   Logging signature for each segment could result in extensive logging, hence 
changing the log level to debug.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [I] Broker OOM Error During SQL Execution. (druid)

2024-06-06 Thread via GitHub


lijie749808 closed issue #15343: Broker OOM Error During SQL Execution.
URL: https://github.com/apache/druid/issues/15343


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Fix backwards compatibility with centralized schema config in partial_index_merge tasks (druid)

2024-06-06 Thread via GitHub


cryptoe merged PR #16556:
URL: https://github.com/apache/druid/pull/16556


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Add IncrementalIndex test (druid)

2024-06-06 Thread via GitHub


kgyrtkirk commented on code in PR #16562:
URL: https://github.com/apache/druid/pull/16562#discussion_r1628970678


##
extensions-contrib/spectator-histogram/src/test/java/org/apache/druid/spectator/histogram/SpectatorHistogramAggregatorTest.java:
##
@@ -365,6 +383,60 @@ public void testBuildingAndCountingHistograms() throws 
Exception
 Assert.assertEquals(9.0, (Double) results.get(0).get(1), 0.001);
   }
 
+  @Test
+  public void testBuildingAndCountingHistogramsIncrementalIndex() throws 
Exception
+  {
+NullHandling.initializeForTestsWithValues(true, true);
+List dimensions = Collections.singletonList("d");
+int n = 10;
+DateTime startOfDay = 
DateTime.now(DateTimeZone.UTC).withTimeAtStartOfDay();
+List inputRows = new ArrayList<>(n);
+for (int i = 1; i <= n; i++) {
+  String val = String.valueOf(i * 1.0d);
+
+  inputRows.add(new MapBasedInputRow(
+  startOfDay.plusMinutes(i),
+  dimensions,
+  ImmutableMap.of("x", i, "d", val)
+  ));
+}
+
+IncrementalIndex index = AggregationTestHelper.createIncrementalIndex(
+inputRows.iterator(),
+new NoopInputRowParser(null),
+new AggregatorFactory[]{
+new CountAggregatorFactory("count"),
+new SpectatorHistogramAggregatorFactory("histogram", "x")
+},
+0,
+Granularities.NONE,
+100,
+false
+);
+
+ImmutableList segments = ImmutableList.of(
+new IncrementalIndexSegment(index, SegmentId.dummy("test")),
+helper.persistIncrementalIndex(index, null)
+);
+
+GroupByQuery query = new GroupByQuery.Builder()
+.setDataSource("test")
+.setGranularity(Granularities.HOUR)
+.setInterval("1970/2050")
+.setAggregatorSpecs(
+new DoubleSumAggregatorFactory("doubleSum", "histogram")

Review Comment:
   I don't really see why this is beneficial at all - why can't the user invoke 
a method which will  translate the  histogram to a literal and the aggregate 
those values with sum?
   
   I think an aggregate which is repesented by a sketch or something could 
normally not be represented by a single literal number; if we want to introduce 
something which translates the sketch to some other type automatically - I 
think that should be registered properly and not hidden in the aggregator as 
some bonus functionality.
   
   am I alone with this mindset?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Attempt to coerce COMPLEX to number in numeric aggregators. (druid)

2024-06-05 Thread via GitHub


gianm commented on PR #16564:
URL: https://github.com/apache/druid/pull/16564#issuecomment-2151419801

   @bsyk @maytasm note I changed the last assert from #16562 from:
   
   ```
   Assert.assertEquals(n, (Double) results.get(0).get(1), 0.001);
   ```
   
   to:
   
   ```
   Assert.assertEquals(n * segments.size(), (Double) results.get(0).get(1), 
0.001);
   ```
   
   Please let me know if this looks right.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Attempt to coerce COMPLEX to number in numeric aggregators. (druid)

2024-06-05 Thread via GitHub


gianm opened a new pull request, #16564:
URL: https://github.com/apache/druid/pull/16564

   PR #15371 eliminated ObjectColumnSelector's built-in implementations of 
numeric methods, which had been marked deprecated.
   
   However, some complex types, like SpectatorHistogram, can be successfully 
coerced to number. The documentation for spectator histograms encourages taking 
advantage of this by aggregating complex columns with doubleSum and longSum. 
Currently, this doesn't work properly for IncrementalIndex, where the behavior 
relied on those deprecated ObjectColumnSelector methods.
   
   This patch fixes the behavior by making two changes:
   
   1) SimpleXYZAggregatorFactory (XYZ = type; base class for simple numeric 
aggregators;
  all of these extend NullableNumericAggregatorFactory) use getObject for 
STRING
  and COMPLEX. Previously, getObject was only used for STRING.
   
   2) NullableNumericAggregatorFactory (base class for simple numeric 
aggregators)
  has a new protected method "useGetObject". This allows the base class to
  correctly check for null (using getObject or isNull).
   
   The patch also adds a test for SpectatorHistogram + doubleSum + 
IncrementalIndex. Thanks @bsyk for the test, which I pulled from #16562.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Add logs and metrics for append segments upgraded by a concurrent replace task (druid)

2024-06-05 Thread via GitHub


kfaraz opened a new pull request, #16563:
URL: https://github.com/apache/druid/pull/16563

   ### Changes
   -  Add metrics `segment/upgraded/count` and `segment/upgradedRealtime/count`
   emitted with dimensions `taskId`, `dataSource`, `taskType`, etc.
   - Add new logs and cleanup existing ones
   - Minor method renames


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-05 Thread via GitHub


Akshat-Jain commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1628749775


##
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java:
##
@@ -103,27 +94,28 @@ public class RetryableS3OutputStream extends OutputStream
   private boolean error;
   private boolean closed;
 
-  public RetryableS3OutputStream(
-  S3OutputConfig config,
-  ServerSideEncryptingAmazonS3 s3,
-  String s3Key
-  ) throws IOException
-  {
+  /**
+   * Helper class for calculating maximum number of simultaneous chunks 
allowed on local disk.
+   */
+  private final S3UploadManager uploadManager;
 
-this(config, s3, s3Key, true);
-  }
+  /**
+   * A list of futures to allow us to wait for completion of all uploadPart() 
calls
+   * before hitting {@link 
ServerSideEncryptingAmazonS3#completeMultipartUpload(CompleteMultipartUploadRequest)}.
+   */
+  private final List> futures = new ArrayList<>();

Review Comment:
   Marking this resolved as per the discussion at 
https://github.com/apache/druid/pull/16481#discussion_r1628218203



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Fallback vectorization for FunctionExpr and BaseMacroFunctionExpr. (druid)

2024-06-05 Thread via GitHub


gianm merged PR #16366:
URL: https://github.com/apache/druid/pull/16366


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Fix serde for ArrayOfDoublesSketchConstantPostAggregator. (druid)

2024-06-05 Thread via GitHub


gianm merged PR #16550:
URL: https://github.com/apache/druid/pull/16550


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [I] Issue with PostAggregator arrayOfDoublesSketchConstant in latest Druid 29.0.1 (druid)

2024-06-05 Thread via GitHub


gianm closed issue #16539: Issue with PostAggregator 
arrayOfDoublesSketchConstant in latest Druid 29.0.1
URL: https://github.com/apache/druid/issues/16539


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Simplify serialized form of JsonInputFormat. (druid)

2024-06-05 Thread via GitHub


gianm merged PR #15691:
URL: https://github.com/apache/druid/pull/15691


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Column name in parse exceptions (druid)

2024-06-05 Thread via GitHub


TSFenwick commented on code in PR #16529:
URL: https://github.com/apache/druid/pull/16529#discussion_r1628629431


##
processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java:
##
@@ -330,25 +329,67 @@ public static Long convertObjectToLong(@Nullable Object 
valObj, boolean reportPa
 } else if (valObj instanceof String) {
   Long ret = DimensionHandlerUtils.getExactLongFromDecimalString((String) 
valObj);
   if (reportParseExceptions && ret == null) {
-throw new ParseException((String) valObj, "could not convert value 
[%s] to long", valObj);
+final String message = (objectKey != null) ? 
StringUtils.nonStrictFormat(
+"could not convert value [%s] to long for dimension [%s]",
+valObj,
+objectKey
+) : StringUtils.nonStrictFormat(
+"could not convert value [%s] to long",
+valObj
+);
+throw new ParseException((String) valObj, message);
   }
   return ret;
 } else if (valObj instanceof List) {
-  throw new ParseException(
-  valObj.getClass().toString(),
+  final String message = (objectKey != null) ? StringUtils.nonStrictFormat(
+  "Could not ingest value [%s] as long for dimension [%s]. A long 
column cannot have multiple values in the same row.",
+  valObj,
+  objectKey
+  ) : StringUtils.nonStrictFormat(
   "Could not ingest value %s as long. A long column cannot have 
multiple values in the same row.",

Review Comment:
   should the list value format string have the `[]`'s around it? it would end 
up looking like [[1,2,3]]. its confusing to me as a developer to know if it 
should have a square bracket or not depending on the type...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Extension to read and ingest iceberg data files (druid)

2024-06-05 Thread via GitHub


a2l007 commented on PR #14329:
URL: https://github.com/apache/druid/pull/14329#issuecomment-2151175460

   @maytasm Yeah, we should definitely call that out in the docs and also print 
some warning error messages. I'll make sure to include these in my next PR.
   Regarding filter pushdown, are you suggesting to do a filter pushdown to 
Druid in all cases or only when the iceberg filter is on a non-partitioned 
column?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Add interface method for returning canonical lookup name (druid)

2024-06-05 Thread via GitHub


abhishekrb19 commented on code in PR #16557:
URL: https://github.com/apache/druid/pull/16557#discussion_r1628399933


##
processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java:
##
@@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider
* Returns a lookup container for the provided lookupName, if it exists.
*/
   Optional get(String lookupName);
+
+  /**
+   * Returns the canonical lookup name from a lookup name.
+   */
+  String getCanonicalLookupName(String lookupName);

Review Comment:
   @LakshSingla, good points. I think we need to clarify the Javadoc a bit more 
to clarify the intent of the new method and/or rename accordingly.  We can also 
call it `getLookupName(String name)` too, but that's a bit ambiguous imo. 
Naming, thoughts?  
   
   As far as 1 is concerned, the interface isn't annotated `@PublicAPI` or 
`@ExtensionsPoint`. Also, this PR https://github.com/apache/druid/pull/9281/ 
changed the method signature and added a new interface method without default 
implementations. Following that pattern, I believe it's safe to add new methods 
without providing a default implementation. In general, my understanding is 
that for custom extensions that directly use/extend interfaces that are not 
public APIs (i.e., not annotated with `PublicAPI` or `ExtensionsPoint`), the 
onus is on the developers maintaining custom implementations to sync with 
changes coming from upstream.
   
   For 2, `LookupSchema` is powering the SQL view for the lookups configured in 
the system. For consistency, I think it'd make sense to also call the new 
function there as you pointed out. I'm okay with doing that as a follow up too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [I] Druid query to get last N transactions from the db (druid)

2024-06-05 Thread via GitHub


github-actions[bot] commented on issue #14914:
URL: https://github.com/apache/druid/issues/14914#issuecomment-2151156678

   This issue has been marked as stale due to 280 days of inactivity.
   It will be closed in 4 weeks if no further activity occurs. If this issue is 
still
   relevant, please simply write any comment. Even if closed, you can still 
revive the
   issue at any time or discuss it on the d...@druid.apache.org list.
   Thank you for your contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Add IncrementalIndex test (druid)

2024-06-05 Thread via GitHub


bsyk opened a new pull request, #16562:
URL: https://github.com/apache/druid/pull/16562

   Test that should repro the IncrementalIndex + ComplexColumn + longSum 
aggregator issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] More validation for Azure account config (druid)

2024-06-05 Thread via GitHub


amaechler opened a new pull request, #16561:
URL: https://github.com/apache/druid/pull/16561

   ### Description
   
   This PR marks two additional properties in the Azure deep storage config as 
required (`@NotNull`) (also see [the 
docs](https://druid.apache.org/docs/latest/development/extensions-core/azure/)):
   
   - `druid.azure.account`
   - `druid.azure.container`
   
   Both these properties must be set. I also cleaned up the 
`AzureAccountConfig` class a bit (no more equality overrides, use JSON 
serialization instead in tests to verify equality), group getters / setters 
together, and add some Javadocs to the `getBlobStorageEndpoint` helper function.
   
   
   
   This PR has:
   
   - [x] been self-reviewed.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Router: Authorize permissionless internal requests. (druid)

2024-06-05 Thread via GitHub


gianm merged PR #16419:
URL: https://github.com/apache/druid/pull/16419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Fallback vectorization for FunctionExpr and BaseMacroFunctionExpr. (druid)

2024-06-05 Thread via GitHub


gianm commented on code in PR #16366:
URL: https://github.com/apache/druid/pull/16366#discussion_r1628508359


##
processing/src/main/java/org/apache/druid/math/expr/vector/FallbackVectorProcessor.java:
##
@@ -0,0 +1,422 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.math.expr.vector;
+
+import org.apache.druid.error.DruidException;
+import org.apache.druid.math.expr.Expr;
+import org.apache.druid.math.expr.ExprEval;
+import org.apache.druid.math.expr.ExprMacroTable;
+import org.apache.druid.math.expr.ExprType;
+import org.apache.druid.math.expr.ExpressionType;
+import org.apache.druid.math.expr.Function;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Objects;
+import java.util.function.Supplier;
+
+/**
+ * Implementation of {@link ExprVectorProcessor} that adapts non-vectorized 
{@link Expr#eval(Expr.ObjectBinding)}.
+ * This allows non-vectorized expressions to participate in vectorized queries.
+ */
+public abstract class FallbackVectorProcessor implements 
ExprVectorProcessor
+{
+  final Supplier> fn;
+  final List adaptedArgs;
+
+  private final ExpressionType outputType;
+
+  private FallbackVectorProcessor(
+  final Supplier> fn,
+  final List adaptedArgs,
+  final ExpressionType outputType
+  )
+  {
+this.fn = fn;
+this.adaptedArgs = adaptedArgs;
+this.outputType = outputType;
+  }
+
+  /**
+   * Create a processor for a non-vectorizable {@link Function}.
+   */
+  public static  FallbackVectorProcessor create(
+  final Function function,
+  final List args,
+  final Expr.VectorInputBindingInspector inspector
+  )
+  {
+final List adaptedArgs = makeAdaptedArgs(args, inspector);
+return makeFallbackProcessor(
+() -> function.apply(adaptedArgs, UnusedBinding.INSTANCE),
+adaptedArgs,
+function.getOutputType(inspector, args),
+inspector
+);
+  }
+
+  /**
+   * Create a processor for a non-vectorizable {@link 
ExprMacroTable.ExprMacro}.
+   */
+  public static  FallbackVectorProcessor create(
+  final ExprMacroTable.ExprMacro macro,
+  final List args,
+  final Expr.VectorInputBindingInspector inspector
+  )
+  {
+final List adaptedArgs = makeAdaptedArgs(args, inspector);
+final Expr adaptedExpr = macro.apply(adaptedArgs);
+return makeFallbackProcessor(
+() -> adaptedExpr.eval(UnusedBinding.INSTANCE),
+adaptedArgs,
+adaptedExpr.getOutputType(inspector),
+inspector
+);
+  }
+
+  /**
+   * Helper for the two {@link #create} methods. Makes {@link AdaptedExpr} 
that can replace the original args to
+   * the {@link Expr} that requires fallback.
+   *
+   * @param args  args to the original expr
+   * @param inspector binding inspector
+   *
+   * @return list of {@link AdaptedExpr}
+   */
+  private static List makeAdaptedArgs(
+  final List args,
+  final Expr.VectorInputBindingInspector inspector
+  )
+  {
+final List adaptedArgs = new ArrayList<>(args.size());
+
+for (final Expr arg : args) {
+  adaptedArgs.add(new AdaptedExpr(arg.asVectorProcessor(inspector), arg));
+}
+
+return adaptedArgs;
+  }
+
+  /**
+   * Helper for the two {@link #create} methods.
+   *
+   * @param fn  eval function that uses the "adaptedArgs" as inputs
+   * @param adaptedArgs adapted args from {@link #makeAdaptedArgs(List, 
Expr.VectorInputBindingInspector)}
+   * @param outputType  output type of the eval from "fn"
+   * @param inspector   binding inspector
+   */
+  @SuppressWarnings({"unchecked", "rawtypes"})
+  private static  FallbackVectorProcessor makeFallbackProcessor(
+  final Supplier> fn,
+  final List adaptedArgs,
+  final ExpressionType outputType,
+  final Expr.VectorInputBindingInspector inspector
+  )
+  {
+if (outputType == null) {
+  throw DruidException.defensive("Plan has null outputType");
+} else if (outputType.equals(ExpressionType.LONG)) {
+  return (FallbackVectorProcessor) new OfLong(fn, (List) adaptedArgs, 
outputType, inspector);
+} else if (outputType.equals(ExpressionType.DOUBLE)) {
+  return 

Re: [I] Problem with array, UNNEST and JSON_PARSE (druid)

2024-06-05 Thread via GitHub


gianm closed issue #16543: Problem with array, UNNEST and JSON_PARSE
URL: https://github.com/apache/druid/issues/16543


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Fix capabilities reported by UnnestStorageAdapter. (druid)

2024-06-05 Thread via GitHub


gianm merged PR #16551:
URL: https://github.com/apache/druid/pull/16551


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[I] Query with JSON_QUERY/JSON_VALUE using parameters across a join fails to be planned (druid)

2024-06-05 Thread via GitHub


tls-applied opened a new issue, #16560:
URL: https://github.com/apache/druid/issues/16560

   ### Description
   Sample query:
   ```sql
   SELECT
 JSON_VALUE(a.data, b.path)
   FROM (
   VALUES
 (PARSE_JSON('{"x": 1}'))
 ) a(data)
 CROSS JOIN (
   VALUES
 ('$.x')
 ) b(path)
   ```
   Relevant errors:
   ```
   2024-06-05T21:36:40,185 WARN [sql[67983b92-e41d-4933-be2e-a172b0f43148]] 
org.apache.druid.sql.calcite.planner.QueryHandler - Query not supported. Please 
check Broker logs for additional details. SQL was: SELECT
 JSON_VALUE(a.data, b.path)
   FROM
 (
   VALUES
 (PARSE_JSON('{"x": 1}'))
 ) a(data)
 CROSS JOIN (
   VALUES
 ('$.x')
 ) b(path)
   LIMIT 11 (org.apache.calcite.plan.RelOptPlanner$CannotPlanException: There 
are not enough rules to produce a node with desired properties: 
convention=DRUID, sort=[].
   Missing conversion is LogicalSort[convention: NONE -> DRUID]
   There is 1 empty subset: rel#53574:Subset#5.DRUID.[], the relevant part of 
the original plan is as follows
   53572:LogicalSort(fetch=[11])
 53570:LogicalProject(subset=[rel#53571:Subset#4.NONE.[]], 
EXPR$0=[CAST(JSON_VALUE_VARCHAR($0, $1)):VARCHAR(2000)])
   53568:LogicalJoin(subset=[rel#53569:Subset#3.NONE.[]], condition=[true], 
joinType=[inner])
 53565:LogicalProject(subset=[rel#53566:Subset#1.NONE.[]], 
EXPR$0=[PARSE_JSON('{"x": 1}')])
   53537:LogicalValues(subset=[rel#53564:Subset#0.NONE.[0]], tuples=[[{ 
0 }]])
 53539:LogicalValues(subset=[rel#53567:Subset#2.NONE.[0]], tuples=[[{ 
'$.x' }]])
   
   Root: rel#53574:Subset#5.DRUID.[]
   Original rel:
   LogicalSort(subset=[rel#53574:Subset#5.DRUID.[]], fetch=[11]): rowcount = 
1.0, cumulative cost = {1.0 rows, 4.0 cpu, 0.0 io}, id = 53572
 LogicalProject(subset=[rel#53571:Subset#4.NONE.[]], 
EXPR$0=[CAST(JSON_VALUE_VARCHAR($0, $1)):VARCHAR(2000)]): rowcount = 1.0, 
cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 53570
   LogicalJoin(subset=[rel#53569:Subset#3.NONE.[]], condition=[true], 
joinType=[inner]): rowcount = 1.0, cumulative cost = {1.0 rows, 0.0 cpu, 0.0 
io}, id = 53568
 LogicalProject(subset=[rel#53566:Subset#1.NONE.[]], 
EXPR$0=[PARSE_JSON('{"x": 1}')]): rowcount = 1.0, cumulative cost = {1.0 rows, 
1.0 cpu, 0.0 io}, id = 53565
   LogicalValues(subset=[rel#53564:Subset#0.NONE.[0]], tuples=[[{ 0 
}]]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 53537
 LogicalValues(subset=[rel#53567:Subset#2.NONE.[0]], tuples=[[{ '$.x' 
}]]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 53539
   )
   ```
   Is this expected to fail and is there a way to work around it?
   
   ### Affected Version
   26.0.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Add interface method for returning canonical lookup name (druid)

2024-06-05 Thread via GitHub


abhishekrb19 commented on PR #16557:
URL: https://github.com/apache/druid/pull/16557#issuecomment-2150998873

   Merging this PR. The two suggestions from @LakshSingla can be addressed in a 
follow-up. Thanks for the fix, @Akshat-Jain!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Add interface method for returning canonical lookup name (druid)

2024-06-05 Thread via GitHub


abhishekrb19 merged PR #16557:
URL: https://github.com/apache/druid/pull/16557


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Add interface method for returning canonical lookup name (druid)

2024-06-05 Thread via GitHub


abhishekrb19 commented on code in PR #16557:
URL: https://github.com/apache/druid/pull/16557#discussion_r1628399933


##
processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java:
##
@@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider
* Returns a lookup container for the provided lookupName, if it exists.
*/
   Optional get(String lookupName);
+
+  /**
+   * Returns the canonical lookup name from a lookup name.
+   */
+  String getCanonicalLookupName(String lookupName);

Review Comment:
   @LakshSingla, good points. I think we need to clarify the Javadoc a bit more 
to clarify the intent of the new method and/or rename accordingly.  We can also 
call it `getLookupName(String name)` too, but that's a bit ambiguous imo. 
Naming, thoughts?  
   
   As far as 1 is concerned, the interface isn't annotated `@PublicAPI` or 
`@ExtensionsPoint`. Also, this PR https://github.com/apache/druid/pull/9281/ 
changed the method signature and added a new interface method without default 
implementations. Following that pattern, I believe it's safe to add new methods 
without providing a default implementation. In general, my understanding is 
that for custom extensions that directly use/extend interfaces that are not 
public APIs (i.e., not annotated with `PublicAPI` or `ExtensionsPoint`), the 
onus is be on the developers maintaining custom implementations.
   
   For 2, `LookupSchema` is powering the SQL view for the lookups configured in 
the system. For consistency, I think it'd make sense to also call the new 
function there as you pointed out. I'm okay with doing that as a follow up too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Only create container in `AzureStorage` for write operations (druid)

2024-06-05 Thread via GitHub


amaechler commented on code in PR #16558:
URL: https://github.com/apache/druid/pull/16558#discussion_r1628461028


##
extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java:
##
@@ -56,255 +54,391 @@
  */
 public class AzureStorage
 {
-
   // Default value from Azure library
   private static final int DELTA_BACKOFF_MS = 30_000;
 
-  // 
https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch?tabs=microsoft-entra-id
+  // 
https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch#request-body
   private static final Integer MAX_MULTI_OBJECT_DELETE_SIZE = 256;
 
-
-  private static final Logger log = new Logger(AzureStorage.class);
+  private static final Logger LOG = new Logger(AzureStorage.class);
 
   private final AzureClientFactory azureClientFactory;
   private final String defaultStorageAccount;
 
   public AzureStorage(
-  AzureClientFactory azureClientFactory,
-  @Nullable String defaultStorageAccount
+  final AzureClientFactory azureClientFactory,
+  final String defaultStorageAccount
   )
   {
 this.azureClientFactory = azureClientFactory;
 this.defaultStorageAccount = defaultStorageAccount;
   }
 
-  public List emptyCloudBlobDirectory(final String containerName, 
final String virtualDirPath)
+  /**
+   * See {@link AzureStorage#emptyCloudBlobDirectory(String, String, Integer)} 
for details.
+   */
+  public List emptyCloudBlobDirectory(final String containerName, 
@Nullable final String prefix)

Review Comment:
   Let me add a comment, that's a good idea. I found it easier understanding 
the code when there was a bit less indirection.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Only create container in `AzureStorage` for write operations (druid)

2024-06-05 Thread via GitHub


amaechler commented on code in PR #16558:
URL: https://github.com/apache/druid/pull/16558#discussion_r1628458947


##
extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java:
##
@@ -56,255 +54,391 @@
  */
 public class AzureStorage

Review Comment:
   That's a valid question.
   
   - `getBlobContainerClient` does nothing yet. It simply creates a client that 
will be used for an API call later.
   - `createBlobContainerIfNotExists` first 
[calls](https://github.com/Azure/azure-sdk-for-java/blob/5648561dc792a993ec0fb4b041d76dc64971b9bb/sdk/storage/azure-storage-blob/src/main/java/com/azure/storage/blob/BlobServiceClient.java#L206-L211)
 `getBlobContainerClient`, and then right away executes an API call 
(`createIfNotExistsWithResponse`) for that client.
   
   In the current implementation, if you _configure_ a container that does not 
exist, the calls that I found call `azureStorage.getBlockBlobExists` first, 
which then simply returns `false` (called for streaming task files, or from the 
storage layer).
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimise indexer resource usage based on CPU profile (druid)

2024-06-05 Thread via GitHub


clintropolis commented on code in PR #16517:
URL: https://github.com/apache/druid/pull/16517#discussion_r1628457954


##
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##
@@ -132,12 +142,14 @@ public EncodedKeyComponent 
processRowValsToUnsortedEncodedKeyComponent(@N
 } else if (dimValues instanceof byte[]) {
   encodedDimensionValues =
   new 
int[]{dimLookup.add(emptyToNullIfNeeded(StringUtils.encodeBase64String((byte[]) 
dimValues)))};
+  dictionaryChanged = true;
 } else {
   encodedDimensionValues = new 
int[]{dimLookup.add(emptyToNullIfNeeded(dimValues))};
+  dictionaryChanged = true;
 }
 
 // If dictionary size has changed, the sorted lookup is no longer valid.
-if (oldDictSize != dimLookup.size()) {
+if (dictionaryChanged) {

Review Comment:
   >so that if its able to recall an earlier value that's actually very 
important ...this change will destroy the sortedLookup cache even in case all 
calls to add were hits.
   
   The lookup is only sorted when persisting a segment, so `add` shouldn't 
really happen after the dictionary is sorted. I would guess this is here as a 
safeguard.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Only create container in `AzureStorage` for write operations (druid)

2024-06-05 Thread via GitHub


georgew5656 commented on code in PR #16558:
URL: https://github.com/apache/druid/pull/16558#discussion_r1628455989


##
extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java:
##
@@ -56,255 +54,391 @@
  */
 public class AzureStorage

Review Comment:
   would it be possible to log a warning somewhere if we're using azure as deep 
storage and the container doesn't exist? where "using azure as deep 
storage"=getAzureStorageContainer is called (i don't think 
getAzureStorageContainer would be called if we were just loading the azure 
extension and using it to ingest data from azure). or maybe even try to create 
the container at this point instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Only create container in `AzureStorage` for write operations (druid)

2024-06-05 Thread via GitHub


georgew5656 commented on PR #16558:
URL: https://github.com/apache/druid/pull/16558#issuecomment-2150976002

   i think the reason the azure extension supports creating its own container 
is that the concept of a container does not have a 1-1 analogue to s3 or google 
storage. for example the extension requires that the storage account exist 
(where a storage account is a direct analogue to a s3 bucket)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Only create container in `AzureStorage` for write operations (druid)

2024-06-05 Thread via GitHub


georgew5656 commented on code in PR #16558:
URL: https://github.com/apache/druid/pull/16558#discussion_r1628442038


##
extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java:
##
@@ -56,255 +54,391 @@
  */
 public class AzureStorage
 {
-
   // Default value from Azure library
   private static final int DELTA_BACKOFF_MS = 30_000;
 
-  // 
https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch?tabs=microsoft-entra-id
+  // 
https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch#request-body
   private static final Integer MAX_MULTI_OBJECT_DELETE_SIZE = 256;
 
-
-  private static final Logger log = new Logger(AzureStorage.class);
+  private static final Logger LOG = new Logger(AzureStorage.class);
 
   private final AzureClientFactory azureClientFactory;
   private final String defaultStorageAccount;
 
   public AzureStorage(
-  AzureClientFactory azureClientFactory,
-  @Nullable String defaultStorageAccount
+  final AzureClientFactory azureClientFactory,
+  final String defaultStorageAccount
   )
   {
 this.azureClientFactory = azureClientFactory;
 this.defaultStorageAccount = defaultStorageAccount;
   }
 
-  public List emptyCloudBlobDirectory(final String containerName, 
final String virtualDirPath)
+  /**
+   * See {@link AzureStorage#emptyCloudBlobDirectory(String, String, Integer)} 
for details.
+   */
+  public List emptyCloudBlobDirectory(final String containerName, 
@Nullable final String prefix)

Review Comment:
   i feel like it might be helpful to have two helper functions here 
(getContainerClient, getOrCreateContainerCilent) and comment on them what they 
are used for (readonly or r/w) for anyone who's implementing a new api call in 
this class to figure out which one to use.
   
   alternatively maybe just a comment at the top of the class?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Only create container in `AzureStorage` for write operations (druid)

2024-06-05 Thread via GitHub


georgew5656 commented on code in PR #16558:
URL: https://github.com/apache/druid/pull/16558#discussion_r1628439982


##
extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java:
##
@@ -56,255 +54,391 @@
  */
 public class AzureStorage

Review Comment:
   thinking about it a bit more, the first write operation is always triggered 
by a task, either a msq task pushing intermediate segments, or the 
middleManager/overlord pushing task logs.
   
   this seems ok to me, the only class of things i can think of failing are 
like, e.g. the task logs cleaner. but if that throws a exception it would be 
fine since no tasks would have been run.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Only create container in `AzureStorage` for write operations (druid)

2024-06-05 Thread via GitHub


georgew5656 commented on code in PR #16558:
URL: https://github.com/apache/druid/pull/16558#discussion_r1628430378


##
extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java:
##
@@ -56,255 +54,391 @@
  */
 public class AzureStorage

Review Comment:
   to be specific, what does getBlobContainerClient do if the container doesn't 
exist?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Only create container in `AzureStorage` for write operations (druid)

2024-06-05 Thread via GitHub


georgew5656 commented on code in PR #16558:
URL: https://github.com/apache/druid/pull/16558#discussion_r1628428060


##
extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java:
##
@@ -56,255 +54,391 @@
  */
 public class AzureStorage

Review Comment:
   is it possible for a cluster using azure as deep storage to first make a 
read-only call to azure (therefore not creating the container) and for the 
function to throw a exception instead of creating the container and returning a 
empty list or something?
   
   like e.g if i start a new cluster in azure with no container what's the 
first api call that gets made and what happens if its a read-only call? i think 
i would be more concerned with error handling, if we return a good exception 
that says "the container doesn't exist" or something like this, then the 
operator could just create the container at that point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Disable event hubs when kafka extensions isn't loaded (druid)

2024-06-05 Thread via GitHub


amaechler opened a new pull request, #16559:
URL: https://github.com/apache/druid/pull/16559

   Fixes #10342.
   
   ### Description
   
   This PR disables the `Azure Event Hubs` tile in the web-console if the kafka 
extension isn't loaded.
   
   
   
   - [x] been self-reviewed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Only create container in `AzureStorage` for write operations (druid)

2024-06-05 Thread via GitHub


georgew5656 commented on code in PR #16558:
URL: https://github.com/apache/druid/pull/16558#discussion_r1628400526


##
extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureDataSegmentKiller.java:
##
@@ -43,7 +43,6 @@
 public class AzureDataSegmentKiller implements DataSegmentKiller
 {
   private static final Logger log = new Logger(AzureDataSegmentKiller.class);
-  private static final Integer MAX_MULTI_OBJECT_DELETE_SIZE = 256; // 
https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch?tabs=microsoft-entra-id

Review Comment:
   it looks like this logic was moved to AzureStorage and this was never 
cleaned up



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Add interface method for returning canonical lookup name (druid)

2024-06-05 Thread via GitHub


abhishekrb19 commented on code in PR #16557:
URL: https://github.com/apache/druid/pull/16557#discussion_r1628399933


##
processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java:
##
@@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider
* Returns a lookup container for the provided lookupName, if it exists.
*/
   Optional get(String lookupName);
+
+  /**
+   * Returns the canonical lookup name from a lookup name.
+   */
+  String getCanonicalLookupName(String lookupName);

Review Comment:
   @LakshSingla, good points. I think we need to clarify the Javadoc a bit more 
to clarify the intent of the new method and/or rename accordingly.  We can also 
call it `getLookupName(String name)` too, but that's a bit ambiguous imo. 
Naming, thoughts?  
   
   As far as 1 is concerned, the interface isn't annotated `@PublicAPI` or 
`@ExtensionsPoint`. Also, this PR https://github.com/apache/druid/pull/9281/ 
changed the method signature and added a new interface method without default 
implementations. Following that pattern, I believe it's safe to add new methods 
without providing a default implementation. In general, my understanding is 
that for custom extensions that directly use/extend interfaces that are not 
public APIs (i.e., not annotated with `PublicAPI` or `ExtensionsPoint`), the 
onus is be on the developers maintaining custom implementations.
   
   For 2, `LookupSchema` is powering the SQL view for the lookups configured in 
the system. I think it'd make sense to also call the new function there as you 
pointed out. I'm okay with doing that as a follow up too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Only create container in `AzureStorage` for write operations (druid)

2024-06-05 Thread via GitHub


amaechler commented on code in PR #16558:
URL: https://github.com/apache/druid/pull/16558#discussion_r1628375979


##
extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/output/AzureStorageConnector.java:
##
@@ -196,7 +196,7 @@ public Iterator listDir(String dirName) throws 
IOException
 final String prefixBasePath = objectPath(dirName);
 List paths;
 try {
-  paths = azureStorage.listDir(config.getContainer(), prefixBasePath, 
config.getMaxRetry());
+  paths = azureStorage.listBlobs(config.getContainer(), prefixBasePath, 
null, config.getMaxRetry());

Review Comment:
   Renamed `listDir` to use more specific Azure terminology inside the 
`Storage` wrapper.



##
extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureUtils.java:
##
@@ -37,8 +37,6 @@
  */
 public class AzureUtils
 {
-
-  public static final String DEFAULT_AZURE_ENDPOINT_SUFFIX = 
"core.windows.net";

Review Comment:
   Unused.



##
extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureDataSegmentKiller.java:
##
@@ -43,7 +43,6 @@
 public class AzureDataSegmentKiller implements DataSegmentKiller
 {
   private static final Logger log = new Logger(AzureDataSegmentKiller.class);
-  private static final Integer MAX_MULTI_OBJECT_DELETE_SIZE = 256; // 
https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch?tabs=microsoft-entra-id

Review Comment:
   This was not used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Only create container in `AzureStorage` for write operations (druid)

2024-06-05 Thread via GitHub


amaechler opened a new pull request, #16558:
URL: https://github.com/apache/druid/pull/16558

   - **Remove unused constants**
   - **Refactor getBlockBlobLength**
   - **Better link**
   - **Upper-case log**
   - **Mark defaultStorageAccount nullable**
   - **Do not always create a new container if it doesn't exist**
   - **Add lots of comments, group methods**
   - **Revert "Mark defaultStorageAccount nullable"**
   
   ### Description
   
   When ingesting data from Azure blob storage, you can currently only do so if 
your SAS token has `write` permissions, even though this is not required to 
list and read blob objects.
   
   The source of this bug lies in the fact that all calls in `AzureStorage` get 
a container client by calling `createBlobContainerIfNotExists`. This SDK method 
actually performs an API call that needs `write` permissions.
   
    Bugfix
   
   To fix this issue, I updated the calls getting container clients in 
`AzureStorage` to only use `createBlobContainerIfNotExists` for write 
operations, where it makes sense.
   
   Specifically, `getBlockBlobOutputStream` and `uploadBlockBlob` will continue 
to create a container if it doesn't exist. This is useful when using Azure blob 
storage for deep storage and/or task logs.
   
   > That said, this behaviour is somewhat unique to the Azure extension. In S3 
for example, I believe the bucket needs to exist. Maybe it would make sense to 
deprecate this behaviour in the future to be more consistent across Druid, and 
require the container to exist.
   
   All the other calls (read blob storage, check if blob exists, list blobs in 
the container, delete blobs) will require the container to exist. I think this 
makes sense given the nature of these calls.
   
    Other changes
   
   I also cleaned up the `AzureStorage` class so it can be easier maintained in 
the future. Specifically, I
   
   - used Azure terminology throughout the class (e.g. `prefix` vs 
`virtualDirPath`)
   - added `@Nullable` annotations where parameters are optional
   - added a whole bunch of Javadoc
   
   
   
   # Key changed/added classes in this PR
   
* `AzureStorage`
   
   
   
   This PR has:
   
   - [ ] been self-reviewed.
   - [x] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-05 Thread via GitHub


kfaraz commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1628226834


##
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.storage.s3.output;
+
+import com.amazonaws.services.s3.model.UploadPartRequest;
+import com.amazonaws.services.s3.model.UploadPartResult;
+import com.google.common.annotations.VisibleForTesting;
+import com.google.inject.Inject;
+import org.apache.druid.guice.ManageLifecycle;
+import org.apache.druid.java.util.common.RetryUtils;
+import org.apache.druid.java.util.common.concurrent.Execs;
+import org.apache.druid.java.util.common.lifecycle.LifecycleStart;
+import org.apache.druid.java.util.common.lifecycle.LifecycleStop;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.druid.storage.s3.S3Utils;
+import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3;
+import org.apache.druid.utils.RuntimeInfo;
+
+import java.io.File;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+
+/**
+ * This class manages uploading files to S3 in chunks, while ensuring that the
+ * number of chunks currently present on local disk does not exceed a specific 
limit.
+ */
+@ManageLifecycle
+public class S3UploadManager
+{
+  private final ExecutorService uploadExecutor;
+
+  private static final Logger log = new Logger(S3UploadManager.class);
+
+  @Inject
+  public S3UploadManager(S3OutputConfig s3OutputConfig, S3ExportConfig 
s3ExportConfig, RuntimeInfo runtimeInfo)
+  {
+int poolSize = Math.max(4, runtimeInfo.getAvailableProcessors());
+int maxNumConcurrentChunks = computeMaxNumConcurrentChunks(s3OutputConfig, 
s3ExportConfig);
+this.uploadExecutor = createExecutorService(poolSize, 
maxNumConcurrentChunks);
+log.info("Initialized executor service for S3 multipart upload with pool 
size [%d] and work queue capacity [%d]",
+ poolSize, maxNumConcurrentChunks);
+  }
+
+  /**
+   * Computes the maximum number of concurrent chunks for an S3 multipart 
upload.
+   * We want to determine the maximum number of concurrent chunks on disk 
based on the maximum value of chunkSize
+   * between the 2 configs: S3OutputConfig and S3ExportConfig.
+   *
+   * @param s3OutputConfig  The S3 output configuration, which may specify a 
custom chunk size.
+   * @param s3ExportConfig  The S3 export configuration, which may also 
specify a custom chunk size.
+   * @return The maximum number of concurrent chunks.
+   */
+  @VisibleForTesting
+  int computeMaxNumConcurrentChunks(S3OutputConfig s3OutputConfig, 
S3ExportConfig s3ExportConfig)
+  {
+long chunkSize = S3OutputConfig.S3_MULTIPART_UPLOAD_MIN_PART_SIZE_BYTES;
+if (s3OutputConfig != null && s3OutputConfig.getChunkSize() != null) {
+  chunkSize = Math.max(chunkSize, s3OutputConfig.getChunkSize());
+}
+if (s3ExportConfig != null && s3ExportConfig.getChunkSize() != null) {
+  chunkSize = Math.max(chunkSize, 
s3ExportConfig.getChunkSize().getBytes());
+}
+
+return (int) (S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES / 
chunkSize);

Review Comment:
   5GB doesn't seem like a bad default but limiting it to 5GB even when we 
might have much more disk available seems weird. Also, if 5GB is the limit we 
want to use, we should define a separate constant rather than reuse the 
constant for max size.
   
   > we should use the overall disk size available for this calculation
   
   Just to clarify, I have not suggested using the entire disk space for 
storing chunks, rather the portion actually allocated for this purpose. That 
allocation may either be a hard-coded value, a fixed percentage of the overall 
disk space or even a config.
   
   In any case, that amount of disk space should be passed as an argument to 
the `computeMaxNumChunksOnDisk` method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For 

Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-05 Thread via GitHub


kfaraz commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1628218203


##
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java:
##
@@ -269,52 +213,74 @@ public void close() throws IOException
 // Closeables are closed in LIFO order
 closer.register(() -> {
   // This should be emitted as a metric
+  long totalChunkSize = (currentChunk.id - 1) * chunkSize + 
currentChunk.length();

Review Comment:
   It's fine for now. We can worry about it when we really need to do something 
with the chunk. For now, let's just stick with the list.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Docs: Remove circular link (druid)

2024-06-05 Thread via GitHub


317brian merged PR #16553:
URL: https://github.com/apache/druid/pull/16553


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Add interface method for returning canonical lookup name (druid)

2024-06-05 Thread via GitHub


Akshat-Jain commented on code in PR #16557:
URL: https://github.com/apache/druid/pull/16557#discussion_r1628194891


##
processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java:
##
@@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider
* Returns a lookup container for the provided lookupName, if it exists.
*/
   Optional get(String lookupName);
+
+  /**
+   * Returns the canonical lookup name from a lookup name.
+   */
+  String getCanonicalLookupName(String lookupName);

Review Comment:
   @LakshSingla 
   I had a discussion with @abhishekrb19 about the naming of this method. We 
concluded on getCanonicalLookupName().
   
   Regarding the other comments:
   1. I don't see the other methods marked as PublicApi either? So seems like 
an unrelated issue to this PR? Thoughts?
   2. This would be used in a custom extension. Would raise a future PR once 
this merges where it would be more clear. But in general, if in future we add a 
lookup name with special syntaxing, then this would be needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Add interface method for returning canonical lookup name (druid)

2024-06-05 Thread via GitHub


Akshat-Jain commented on code in PR #16557:
URL: https://github.com/apache/druid/pull/16557#discussion_r1628194891


##
processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java:
##
@@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider
* Returns a lookup container for the provided lookupName, if it exists.
*/
   Optional get(String lookupName);
+
+  /**
+   * Returns the canonical lookup name from a lookup name.
+   */
+  String getCanonicalLookupName(String lookupName);

Review Comment:
   @LakshSingla 
   I had a discussion with @abhishekrb19 about the naming of this method. We 
concluded on getCanonicalLookupName().
   
   Regarding the other comments:
   1. I don't see the other methods marked as PublicApi either? So seems like 
an unrelated issue to this PR? Thoughts?
   2. This would be used in Imply extension. Would raise an Imply PR once this 
merges where it would be more clear. But in general, if in future we add a 
lookup name with special syntaxing, then this would be needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Add interface method for returning canonical lookup name (druid)

2024-06-05 Thread via GitHub


LakshSingla commented on code in PR #16557:
URL: https://github.com/apache/druid/pull/16557#discussion_r1628189672


##
processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java:
##
@@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider
* Returns a lookup container for the provided lookupName, if it exists.
*/
   Optional get(String lookupName);
+
+  /**
+   * Returns the canonical lookup name from a lookup name.
+   */
+  String getCanonicalLookupName(String lookupName);

Review Comment:
   naming nit:
   1. This can be getLookupName(String name). Canonical is confusing, and isn't 
defined in the javadoc here. See point (2) below as well. 
   
   other comments:
   1. This method isn't marked as `PublicApi` however custom extensions can 
extend it for own lookup implementations. Please add a default implementation 
for the same. 
   2. When should this method be called. As of now, this is a super specialized 
method that is called in `QueryLookupOperatorConversion`. Perhaps it should be 
called out and named as such. As a developer, it is unclear to me when must I 
call it and when not. For example, why am I not calling it on 
https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/schema/LookupSchema.java#L61?
 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Add interface method for returning canonical lookup name (druid)

2024-06-05 Thread via GitHub


LakshSingla commented on code in PR #16557:
URL: https://github.com/apache/druid/pull/16557#discussion_r1628189672


##
processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java:
##
@@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider
* Returns a lookup container for the provided lookupName, if it exists.
*/
   Optional get(String lookupName);
+
+  /**
+   * Returns the canonical lookup name from a lookup name.
+   */
+  String getCanonicalLookupName(String lookupName);

Review Comment:
   naming nit:
   1. This can be getLookupName(String name). Canonical is confusing, and isn't 
defined in the javadocs. 
   
   other comments:
   1. This method isn't marked as `PublicApi` however custom extensions can 
extend it for own lookup implementations. Please add a default implementation 
for the same. 
   2. When should this method be called. As of now, this is a super specialized 
method that is called in `QueryLookupOperatorConversion`. Perhaps it should be 
called out and named as such. As a developer, it is unclear to me when must I 
call it and when not. For example, why am I not calling it on 
https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/schema/LookupSchema.java#L61?
 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Add interface method for returning canonical lookup name (druid)

2024-06-05 Thread via GitHub


abhishekrb19 commented on code in PR #16557:
URL: https://github.com/apache/druid/pull/16557#discussion_r1628157297


##
processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java:
##
@@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider
* Returns a lookup container for the provided lookupName, if it exists.
*/
   Optional get(String lookupName);
+
+  /**
+   * Returns the canonical lookup name from a lookup name.

Review Comment:
   ```suggestion
  * Returns the canonical lookup name from a given lookup name, if special 
syntax exists.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Add interface method for returning canonical lookup name (druid)

2024-06-05 Thread via GitHub


Akshat-Jain opened a new pull request, #16557:
URL: https://github.com/apache/druid/pull/16557

   ### Description
   
   This PR adds support for returning canonical lookup name in 
`LookupExtractorFactoryContainerProvider` interface. This would allow the 
implementations to return a different canonical name for the lookup if needed.
   
   This PR has:
   
   - [x] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Do not pass centralized schema config to sub-tasks of index_parallel to ensure backwards compatibility (druid)

2024-06-05 Thread via GitHub


kfaraz commented on code in PR #16556:
URL: https://github.com/apache/druid/pull/16556#discussion_r1628120211


##
indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/PartialGenericSegmentMergeTask.java:
##
@@ -67,7 +67,7 @@ public PartialGenericSegmentMergeTask(
   @JsonProperty("numAttempts") final int numAttempts, // zero-based 
counting
   @JsonProperty("spec") final PartialSegmentMergeIngestionSpec 
ingestionSchema,
   @JsonProperty("context") final Map context,
-  @JsonProperty("centralizedDatasourceSchemaConfig") 
CentralizedDatasourceSchemaConfig centralizedDatasourceSchemaConfig,
+  @JsonProperty("centralizedDatasourceSchemaConfig") @Nullable 
CentralizedDatasourceSchemaConfig centralizedDatasourceSchemaConfig,

Review Comment:
   All other usages of `CentralizedDatasourceSchemaConfig` are injected ones 
which are bound using the config path `druid.centralizedDatasourceSchema`. So 
they should all be non-null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Handle null values of centralized schema config in PartialMergeTask for backwards compatibility (druid)

2024-06-05 Thread via GitHub


cryptoe commented on code in PR #16556:
URL: https://github.com/apache/druid/pull/16556#discussion_r1628076745


##
indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/PartialGenericSegmentMergeTask.java:
##
@@ -67,7 +67,7 @@ public PartialGenericSegmentMergeTask(
   @JsonProperty("numAttempts") final int numAttempts, // zero-based 
counting
   @JsonProperty("spec") final PartialSegmentMergeIngestionSpec 
ingestionSchema,
   @JsonProperty("context") final Map context,
-  @JsonProperty("centralizedDatasourceSchemaConfig") 
CentralizedDatasourceSchemaConfig centralizedDatasourceSchemaConfig,
+  @JsonProperty("centralizedDatasourceSchemaConfig") @Nullable 
CentralizedDatasourceSchemaConfig centralizedDatasourceSchemaConfig,

Review Comment:
   Can we do a quick grep for CentralizedDatasourceSchemaConfig and see if we 
have nullable's everywhere ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Handle null values of centralized schema config in PartialMergeTask for backwards compatibility (druid)

2024-06-05 Thread via GitHub


cryptoe commented on code in PR #16556:
URL: https://github.com/apache/druid/pull/16556#discussion_r1628076745


##
indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/PartialGenericSegmentMergeTask.java:
##
@@ -67,7 +67,7 @@ public PartialGenericSegmentMergeTask(
   @JsonProperty("numAttempts") final int numAttempts, // zero-based 
counting
   @JsonProperty("spec") final PartialSegmentMergeIngestionSpec 
ingestionSchema,
   @JsonProperty("context") final Map context,
-  @JsonProperty("centralizedDatasourceSchemaConfig") 
CentralizedDatasourceSchemaConfig centralizedDatasourceSchemaConfig,
+  @JsonProperty("centralizedDatasourceSchemaConfig") @Nullable 
CentralizedDatasourceSchemaConfig centralizedDatasourceSchemaConfig,

Review Comment:
   Can we do a quick grep for CentralizedDatasourceSchemaConfig and see we have 
nullable's everywhere ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [I] [DRAFT] 30.0.0 release notes (druid)

2024-06-05 Thread via GitHub


frankgrimes97 commented on issue #16505:
URL: https://github.com/apache/druid/issues/16505#issuecomment-2150427461

   @adarshsanjeev Could the following in-progress bugfix 
https://github.com/apache/druid/pull/16550 be considered for `30.0.0`? Not sure 
where you guys are in terms for release timeline/cadence but we'd ideally like 
to test/verify it on perhaps `30.0.0-rc3` instead of waiting for it to land in 
`31.0.0` in August/September.
   We'd also be open to having it considered for a possible `29.0.2` release.
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Handle null values of centralized schema config in PartialMergeTask for backwards compatibility (druid)

2024-06-05 Thread via GitHub


kfaraz opened a new pull request, #16556:
URL: https://github.com/apache/druid/pull/16556

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [I] Issue with PostAggregator arrayOfDoublesSketchConstant in latest Druid 29.0.1 (druid)

2024-06-05 Thread via GitHub


frankgrimes97 commented on issue #16539:
URL: https://github.com/apache/druid/issues/16539#issuecomment-2150405429

   @cryptoe Any chance a `29.0.2` or even a `30.0.0-rc3` release could be cut 
with this fix (once approved/CI tested/merged to `master` on PR-16550?
   Choosing between waiting until a September `31.0.0` release and 
forking/building our own distribution image for a bugfix feels a bit painful. 
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Fix incorrect comparator usage in FinalizingFieldAccessPostAggregator (druid)

2024-06-05 Thread via GitHub


kgyrtkirk opened a new pull request, #16555:
URL: https://github.com/apache/druid/pull/16555

   * Fixes #16554 by using the `TypeStrategy` instead of the `Aggregator`'s 
comparator.
   * removed the fallback case when the column was not found - I don't think 
that should be allowed to happen...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[I] Incorrect comparator usage in FinalizingPostAggregators (druid)

2024-06-05 Thread via GitHub


kgyrtkirk opened a new issue, #16554:
URL: https://github.com/apache/druid/issues/16554

   Queries like
   ```
   
   
   For aggregators which have a non-trivial finalization 
`AggregatorFactory#getResultType() != AggregatorFactory#getIntermediateType()`  
(so essentialy those which will most likely need some work to be done).
   
   The comparator is derived for the underlying `aggregator` for the 
`postAggregator` 
[here](https://github.com/apache/druid/blob/30c59042e0cde3c76392afa3ba73d0830b11b0a5/processing/src/main/java/org/apache/druid/query/aggregation/post/FinalizingFieldAccessPostAggregator.java#L124)
 - so a comparision of finalized values might be tried with the aggregator's 
comparator.
   
   This could cause an exception in case the postAggregators output  is being 
used to order the rows by it.
   
   
   This causes a ClassCastException
   
   ```
   java.lang.ClassCastException: class java.lang.Float cannot be cast to class 
org.apache.druid.collections.SerializablePair (java.lang.Float is in module 
java.base of loader 'bootstrap'; org.apache.druid.collections.SerializablePair 
is in unnamed module of loader 'app')
at 
java.base/java.util.Collections$ReverseComparator2.compare(Collections.java:5278)
at 
java.base/java.util.Comparators$NullComparator.compare(Comparators.java:83)
at 
org.apache.druid.query.topn.TopNNumericResultBuilder.lambda$0(TopNNumericResultBuilder.java:75)
at 
java.base/java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:675)
at java.base/java.util.PriorityQueue.siftUp(PriorityQueue.java:652)
at java.base/java.util.PriorityQueue.offer(PriorityQueue.java:345)
at java.base/java.util.PriorityQueue.add(PriorityQueue.java:326)
at 
org.apache.druid.query.topn.TopNNumericResultBuilder.addEntry(TopNNumericResultBuilder.java:157)
at 
org.apache.druid.query.topn.TopNNumericResultBuilder.addEntry(TopNNumericResultBuilder.java:1)
at 
org.apache.druid.query.topn.PooledTopNAlgorithm.updateResults(PooledTopNAlgorithm.java:765)
at 
org.apache.druid.query.topn.PooledTopNAlgorithm.updateResults(PooledTopNAlgorithm.java:1)
at 
org.apache.druid.query.topn.BaseTopNAlgorithm.runWithCardinalityKnown(BaseTopNAlgorithm.java:121)
at 
org.apache.druid.query.topn.BaseTopNAlgorithm.run(BaseTopNAlgorithm.java:82)
at org.apache.druid.query.topn.TopNMapFn.apply(TopNMapFn.java:70)
at 
org.apache.druid.query.topn.TopNQueryEngine.lambda$0(TopNQueryEngine.java:100)
at 
org.apache.druid.java.util.common.guava.MappingAccumulator.accumulate(MappingAccumulator.java:40)
at 
org.apache.druid.java.util.common.guava.FilteringAccumulator.accumulate(FilteringAccumulator.java:41)
at 
org.apache.druid.java.util.common.guava.MappingAccumulator.accumulate(MappingAccumulator.java:40)
at 
org.apache.druid.java.util.common.guava.BaseSequence.accumulate(BaseSequence.java:44)
at 
org.apache.druid.java.util.common.guava.MappedSequence.accumulate(MappedSequence.java:43)
at 
org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
at 
org.apache.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55)
at 
org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
at 
org.apache.druid.java.util.common.guava.FilteredSequence.accumulate(FilteredSequence.java:45)
at 
org.apache.druid.java.util.common.guava.MappedSequence.accumulate(MappedSequence.java:43)
at 
org.apache.druid.java.util.common.guava.FilteredSequence.accumulate(FilteredSequence.java:45)
at 
org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
at 
org.apache.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55)
at 
org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
at 
org.apache.druid.query.spec.SpecificSegmentQueryRunner$1.accumulate(SpecificSegmentQueryRunner.java:98)
at 
org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
at 
org.apache.druid.query.spec.SpecificSegmentQueryRunner.doNamed(SpecificSegmentQueryRunner.java:185)
at 
org.apache.druid.query.spec.SpecificSegmentQueryRunner.access$1(SpecificSegmentQueryRunner.java:181)
at 
org.apache.druid.query.spec.SpecificSegmentQueryRunner$2.wrap(SpecificSegmentQueryRunner.java:165)
at 
org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
at 
org.apache.druid.java.util.common.guava.Sequence.toList(Sequence.java:87)
at 
org.apache.druid.query.ChainedExecutionQueryRunner$1$3.call(ChainedExecutionQueryRunner.java:112)
at 
org.apache.druid.query.ChainedExecutionQueryRunner$1$3.call(ChainedExecutionQueryRunner.java:1)
at 

Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-05 Thread via GitHub


Akshat-Jain commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1627923604


##
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java:
##
@@ -269,52 +213,74 @@ public void close() throws IOException
 // Closeables are closed in LIFO order
 closer.register(() -> {
   // This should be emitted as a metric
+  long totalChunkSize = (currentChunk.id - 1) * chunkSize + 
currentChunk.length();

Review Comment:
   > To me, having the map seems cleaner as we know which future belongs to 
which chunk
   
   @kfaraz The returned future (if successful) contains UploadPartResult which 
contains the partNumber which is the chunk ID, if that helps.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[PR] Docs: Remove circular link (druid)

2024-06-05 Thread via GitHub


ektravel opened a new pull request, #16553:
URL: https://github.com/apache/druid/pull/16553

   Remove circular link.
   
   This PR has:
   
   - [x] been self-reviewed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-05 Thread via GitHub


kfaraz commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1627919294


##
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java:
##
@@ -269,52 +213,74 @@ public void close() throws IOException
 // Closeables are closed in LIFO order
 closer.register(() -> {
   // This should be emitted as a metric
+  long totalChunkSize = (currentChunk.id - 1) * chunkSize + 
currentChunk.length();

Review Comment:
   > It allows us to not have a Map unnecessarily
   
   To me, having the map seems cleaner as we know which future belongs to which 
chunk. I don't see how it would affect the rest of the code as we just need to 
put stuff in the map and then iterate over it in `completeMultipartUpload()` 
and then while determining the size (which could even just be returned by the 
`completeMultipartUpload()` method).
   
   That said, I don't really mind the current approach either. I guess the map 
is more of a personal preference in this case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimise indexer resource usage based on CPU profile (druid)

2024-06-05 Thread via GitHub


kgyrtkirk commented on code in PR #16517:
URL: https://github.com/apache/druid/pull/16517#discussion_r1627535702


##
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##
@@ -132,12 +142,14 @@ public EncodedKeyComponent 
processRowValsToUnsortedEncodedKeyComponent(@N
 } else if (dimValues instanceof byte[]) {
   encodedDimensionValues =
   new 
int[]{dimLookup.add(emptyToNullIfNeeded(StringUtils.encodeBase64String((byte[]) 
dimValues)))};
+  dictionaryChanged = true;
 } else {
   encodedDimensionValues = new 
int[]{dimLookup.add(emptyToNullIfNeeded(dimValues))};
+  dictionaryChanged = true;
 }
 
 // If dictionary size has changed, the sorted lookup is no longer valid.
-if (oldDictSize != dimLookup.size()) {
+if (dictionaryChanged) {

Review Comment:
   I believe the main purpose of the `dimLookup` field is to provide a common 
place where the different strings get an assigned number - so that if its able 
to recall an earlier value that's actually very important ...this change will 
destroy the `sortedLookup` cache even in case all calls to `add` were hits.
   
   I think you should either remove the changes in this file - or find a 
different way to address this; one idea could be: to connect the `sortedLookup` 
more closely when an add happens:
   * move the sorted dictionary caching to `StringDimensionDictionary`
   * extract a protected `realAdd` from the current `add` in 
`DimensionDictionary` 
   * implement `realAdd` to clear the cached `sorted` stuff in 
`StringDimensionDictionary`
   
   again; let me invite @clintropolis to see if He have a different idea :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimise indexer resource usage based on CPU profile (druid)

2024-06-05 Thread via GitHub


kgyrtkirk commented on code in PR #16517:
URL: https://github.com/apache/druid/pull/16517#discussion_r1627535702


##
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##
@@ -132,12 +142,14 @@ public EncodedKeyComponent 
processRowValsToUnsortedEncodedKeyComponent(@N
 } else if (dimValues instanceof byte[]) {
   encodedDimensionValues =
   new 
int[]{dimLookup.add(emptyToNullIfNeeded(StringUtils.encodeBase64String((byte[]) 
dimValues)))};
+  dictionaryChanged = true;
 } else {
   encodedDimensionValues = new 
int[]{dimLookup.add(emptyToNullIfNeeded(dimValues))};
+  dictionaryChanged = true;
 }
 
 // If dictionary size has changed, the sorted lookup is no longer valid.
-if (oldDictSize != dimLookup.size()) {
+if (dictionaryChanged) {

Review Comment:
   I believe the main purpose of the `dimLookup` field is to provide a common 
place where the different strings get an assigned number - so that if its able 
to recall an earlier value that's actually very important ...this change will 
destroy the `sortedLookup` cache even in case all calls to `add` were hits.
   
   I think you should either remove the changes in this file - or find a 
different way to address this; one idea could be; to connect the 
`StringDimensionIndexer` class as a modification listener of `dimLookup` - so 
that when a real add happens the `sortedLookup` will automatically destructed:
   * move the sorted dictionary caching to `StringDimensionDictionary`
   * extract a protected `realAdd` from the current `add` in 
`DimensionDictionary` 
   * implement `realAdd` to clear the cached `sorted` stuff in 
`StringDimensionDictionary`
   
   again; let me invite @clintropolis to see if He have a different idea :)



##
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##
@@ -93,14 +95,26 @@ public EncodedKeyComponent 
processRowValsToUnsortedEncodedKeyComponent(@N
 
 if (dimValues == null) {
   final int nullId = dimLookup.getId(null);

Review Comment:
   you don't need the `if` in case you use `dimLookup.add(null)` instead of 
`getId`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Fix a deadlock that can happen when mutliple stages execute parallelly in a worker (druid)

2024-06-05 Thread via GitHub


asdf2014 commented on code in PR #16524:
URL: https://github.com/apache/druid/pull/16524#discussion_r1627360761


##
processing/src/main/java/org/apache/druid/frame/processor/Bouncer.java:
##
@@ -32,18 +32,45 @@
 
 /**
  * Limiter for access to some resource.
- *
- * Used by {@link FrameProcessorExecutor#runAllFully} to limit the number of 
outstanding processors.
+ * 
+ * The class performs the work of a "bouncer", limiting the number of threads 
that can concurrently access the guarded
+ * critical sections by the bouncer. The bouncer is initialized with the max 
number of threads that can enter the
+ * gaurded section(s) at a time.

Review Comment:
   ```suggestion
* guarded section(s) at a time.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[I] Result level cache key collisions from utf8 encoding (druid)

2024-06-05 Thread via GitHub


gianm opened a new issue, #16552:
URL: https://github.com/apache/druid/issues/16552

   The `ResultLevelCachingQueryRunner` does this when it computes the cache key 
for a query:
   
   ```
   final String cacheKeyStr = 
StringUtils.fromUtf8(strategy.computeResultLevelCacheKey(query));
   ```
   
   This is bad because not all byte strings can be represented as Java Strings. 
For example, the byte arrays `[63, -48, 0, 0, 0, 0, 0, 0]` and `[63, -24, 0, 0, 
0, 0, 0, 0]` have the same representation when converted to Strings. Both have 
an invalid second character, which both get mapped to the replacement 
character. These byte arrays are the representations of doubles `0.25` and 
`0.75` respectively, which is how this was originally noticed (two queries had 
a cache key collision when they differed only in that one used `0.25` and one 
used `0.75` as a quantile parameter).
   
   To fix this we need to always use `byte[]`, never `String`, when dealing 
with cache keys.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-05 Thread via GitHub


Akshat-Jain commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1627296496


##
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java:
##
@@ -269,52 +213,74 @@ public void close() throws IOException
 // Closeables are closed in LIFO order
 closer.register(() -> {
   // This should be emitted as a metric
+  long totalChunkSize = (currentChunk.id - 1) * chunkSize + 
currentChunk.length();

Review Comment:
   It allows us to not have a Map unnecessarily and helps the 
rest of the code be cleaner and simpler, which I really liked. Can you please 
elaborate more on why you'd prefer adding them individually? Thanks!
   
   PS: Can the chunkSize change in the middle of the operation? (in case of 
node failures or rolling upgrades where we update the chunk size or something 
along those lines?) 樂 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-05 Thread via GitHub


Akshat-Jain commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1627284471


##
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.storage.s3.output;
+
+import com.amazonaws.services.s3.model.UploadPartRequest;
+import com.amazonaws.services.s3.model.UploadPartResult;
+import com.google.common.annotations.VisibleForTesting;
+import com.google.inject.Inject;
+import org.apache.druid.guice.ManageLifecycle;
+import org.apache.druid.java.util.common.RetryUtils;
+import org.apache.druid.java.util.common.concurrent.Execs;
+import org.apache.druid.java.util.common.lifecycle.LifecycleStart;
+import org.apache.druid.java.util.common.lifecycle.LifecycleStop;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.druid.storage.s3.S3Utils;
+import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3;
+import org.apache.druid.utils.RuntimeInfo;
+
+import java.io.File;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+
+/**
+ * This class manages uploading files to S3 in chunks, while ensuring that the
+ * number of chunks currently present on local disk does not exceed a specific 
limit.
+ */
+@ManageLifecycle
+public class S3UploadManager
+{
+  private final ExecutorService uploadExecutor;
+
+  private static final Logger log = new Logger(S3UploadManager.class);
+
+  @Inject
+  public S3UploadManager(S3OutputConfig s3OutputConfig, S3ExportConfig 
s3ExportConfig, RuntimeInfo runtimeInfo)
+  {
+int poolSize = Math.max(4, runtimeInfo.getAvailableProcessors());
+int maxNumConcurrentChunks = computeMaxNumConcurrentChunks(s3OutputConfig, 
s3ExportConfig);
+this.uploadExecutor = createExecutorService(poolSize, 
maxNumConcurrentChunks);
+log.info("Initialized executor service for S3 multipart upload with pool 
size [%d] and work queue capacity [%d]",
+ poolSize, maxNumConcurrentChunks);
+  }
+
+  /**
+   * Computes the maximum number of concurrent chunks for an S3 multipart 
upload.
+   * We want to determine the maximum number of concurrent chunks on disk 
based on the maximum value of chunkSize
+   * between the 2 configs: S3OutputConfig and S3ExportConfig.
+   *
+   * @param s3OutputConfig  The S3 output configuration, which may specify a 
custom chunk size.
+   * @param s3ExportConfig  The S3 export configuration, which may also 
specify a custom chunk size.
+   * @return The maximum number of concurrent chunks.
+   */
+  @VisibleForTesting
+  int computeMaxNumConcurrentChunks(S3OutputConfig s3OutputConfig, 
S3ExportConfig s3ExportConfig)
+  {
+long chunkSize = S3OutputConfig.S3_MULTIPART_UPLOAD_MIN_PART_SIZE_BYTES;
+if (s3OutputConfig != null && s3OutputConfig.getChunkSize() != null) {
+  chunkSize = Math.max(chunkSize, s3OutputConfig.getChunkSize());
+}
+if (s3ExportConfig != null && s3ExportConfig.getChunkSize() != null) {
+  chunkSize = Math.max(chunkSize, 
s3ExportConfig.getChunkSize().getBytes());
+}
+
+return (int) (S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES / 
chunkSize);

Review Comment:
   @kfaraz I was trying to do it similar to how it's done in 
`WorkerStorageParameters`. IIUC, in `WorkerStorageParameters`, we have a 
hard-coded size of 1 GB (`MINIMUM_SUPER_SORTER_TMP_STORAGE_BYTES = 
1_000_000_000L;`) which is used instead of computing the overall available disk 
space.
   
   Also, I don't think we should use the overall disk size available for this 
calculation, and should rather have a hard-limit? I chose 5 GB as that's the 
maximum allowed chunk size by S3. In the default case, chunk size is 100 MB, so 
this means around 50 chunks allowed on disk at once, which seems fine?
   
   Thoughts? cc: @cryptoe 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:

Re: [PR] Fix serde for ArrayOfDoublesSketchConstantPostAggregator. (druid)

2024-06-05 Thread via GitHub


gianm commented on code in PR #16550:
URL: https://github.com/apache/druid/pull/16550#discussion_r1627259662


##
extensions-core/datasketches/src/test/java/org/apache/druid/query/aggregation/datasketches/tuple/ArrayOfDoublesSketchConstantPostAggregatorTest.java:
##
@@ -66,6 +67,24 @@ public void testComparator()
 Assert.assertEquals(Comparators.alwaysEqual(), postAgg.getComparator());
   }
 
+  @Test
+  public void testSerde() throws JsonProcessingException
+  {
+final PostAggregator there = new 
ArrayOfDoublesSketchConstantPostAggregator(
+"p",
+
"AQEJAwgBzJP/fwIAzT6NGdX0aWUOJvS5EIhpLwA="
+);
+DefaultObjectMapper mapper = new DefaultObjectMapper();
+mapper.registerModules(new 
ArrayOfDoublesSketchModule().getJacksonModules());
+PostAggregator andBackAgain = mapper.readValue(
+mapper.writeValueAsString(there),
+PostAggregator.class
+);

Review Comment:
   Perhaps we could use some approach like `EqualsVerifier`. Whatever magic it 
does to verify `equals` and `hashCode` could possible be used to verify serde 
as well? I never looked inside it to see what it's doing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Fix serde for ArrayOfDoublesSketchConstantPostAggregator. (druid)

2024-06-05 Thread via GitHub


gianm commented on PR #16550:
URL: https://github.com/apache/druid/pull/16550#issuecomment-2149208873

   Fixed the excessive imports. Thanks for reviewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [I] middleManager container is unable to connect a zookeeper container under bridge mode network (druid)

2024-06-05 Thread via GitHub


vladComan0 commented on issue #9522:
URL: https://github.com/apache/druid/issues/9522#issuecomment-2149195180

   @akamensky Any luck finding the answer? I got that same error out of nowhere.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

2024-06-05 Thread via GitHub


kfaraz commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1626257336


##
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.storage.s3.output;
+
+import com.amazonaws.services.s3.model.UploadPartRequest;
+import com.amazonaws.services.s3.model.UploadPartResult;
+import com.google.common.annotations.VisibleForTesting;
+import com.google.inject.Inject;
+import org.apache.druid.guice.ManageLifecycle;
+import org.apache.druid.java.util.common.RetryUtils;
+import org.apache.druid.java.util.common.concurrent.Execs;
+import org.apache.druid.java.util.common.lifecycle.LifecycleStart;
+import org.apache.druid.java.util.common.lifecycle.LifecycleStop;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.druid.storage.s3.S3Utils;
+import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3;
+import org.apache.druid.utils.RuntimeInfo;
+
+import java.io.File;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+
+/**
+ * This class manages uploading files to S3 in chunks, while ensuring that the
+ * number of chunks currently present on local disk does not exceed a specific 
limit.
+ */
+@ManageLifecycle
+public class S3UploadManager
+{
+  private final ExecutorService uploadExecutor;
+
+  private static final Logger log = new Logger(S3UploadManager.class);
+
+  @Inject
+  public S3UploadManager(S3OutputConfig s3OutputConfig, S3ExportConfig 
s3ExportConfig, RuntimeInfo runtimeInfo)
+  {
+int poolSize = Math.max(4, runtimeInfo.getAvailableProcessors());
+int maxNumConcurrentChunks = computeMaxNumConcurrentChunks(s3OutputConfig, 
s3ExportConfig);
+this.uploadExecutor = createExecutorService(poolSize, 
maxNumConcurrentChunks);
+log.info("Initialized executor service for S3 multipart upload with pool 
size [%d] and work queue capacity [%d]",
+ poolSize, maxNumConcurrentChunks);
+  }
+
+  /**
+   * Computes the maximum number of concurrent chunks for an S3 multipart 
upload.
+   * We want to determine the maximum number of concurrent chunks on disk 
based on the maximum value of chunkSize
+   * between the 2 configs: S3OutputConfig and S3ExportConfig.
+   *
+   * @param s3OutputConfig  The S3 output configuration, which may specify a 
custom chunk size.
+   * @param s3ExportConfig  The S3 export configuration, which may also 
specify a custom chunk size.
+   * @return The maximum number of concurrent chunks.
+   */
+  @VisibleForTesting
+  int computeMaxNumConcurrentChunks(S3OutputConfig s3OutputConfig, 
S3ExportConfig s3ExportConfig)
+  {
+long chunkSize = S3OutputConfig.S3_MULTIPART_UPLOAD_MIN_PART_SIZE_BYTES;
+if (s3OutputConfig != null && s3OutputConfig.getChunkSize() != null) {
+  chunkSize = Math.max(chunkSize, s3OutputConfig.getChunkSize());
+}
+if (s3ExportConfig != null && s3ExportConfig.getChunkSize() != null) {
+  chunkSize = Math.max(chunkSize, 
s3ExportConfig.getChunkSize().getBytes());
+}
+
+return (int) (S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES / 
chunkSize);

Review Comment:
   Shouldn't the numerator be the overall disk size available for keeping 
chunks?
   The value used here `S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES` 
is 5gb which just seems to be the maximum object size limit imposed by S3.



##
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software 

Re: [PR] Fallback vectorization for FunctionExpr and BaseMacroFunctionExpr. (druid)

2024-06-05 Thread via GitHub


clintropolis commented on code in PR #16366:
URL: https://github.com/apache/druid/pull/16366#discussion_r1627216880


##
processing/src/main/java/org/apache/druid/query/filter/vector/StringObjectVectorValueMatcher.java:
##
@@ -58,7 +58,7 @@ public ReadableVectorMatch match(final ReadableVectorMatch 
mask, boolean include
 
 for (int i = 0; i < mask.getSelectionSize(); i++) {
   final int rowNum = mask.getSelection()[i];
-  if ((value == null && includeUnknown) || Objects.equals(value, 
vector[rowNum])) {
+  if ((vector[rowNum] == null && includeUnknown) || 
Objects.equals(value, vector[rowNum])) {

Review Comment:
   nice :+1:



##
processing/src/main/java/org/apache/druid/math/expr/vector/FallbackVectorProcessor.java:
##
@@ -0,0 +1,422 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.math.expr.vector;
+
+import org.apache.druid.error.DruidException;
+import org.apache.druid.math.expr.Expr;
+import org.apache.druid.math.expr.ExprEval;
+import org.apache.druid.math.expr.ExprMacroTable;
+import org.apache.druid.math.expr.ExprType;
+import org.apache.druid.math.expr.ExpressionType;
+import org.apache.druid.math.expr.Function;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Objects;
+import java.util.function.Supplier;
+
+/**
+ * Implementation of {@link ExprVectorProcessor} that adapts non-vectorized 
{@link Expr#eval(Expr.ObjectBinding)}.
+ * This allows non-vectorized expressions to participate in vectorized queries.
+ */
+public abstract class FallbackVectorProcessor implements 
ExprVectorProcessor
+{
+  final Supplier> fn;
+  final List adaptedArgs;
+
+  private final ExpressionType outputType;
+
+  private FallbackVectorProcessor(
+  final Supplier> fn,
+  final List adaptedArgs,
+  final ExpressionType outputType
+  )
+  {
+this.fn = fn;
+this.adaptedArgs = adaptedArgs;
+this.outputType = outputType;
+  }
+
+  /**
+   * Create a processor for a non-vectorizable {@link Function}.
+   */
+  public static  FallbackVectorProcessor create(
+  final Function function,
+  final List args,
+  final Expr.VectorInputBindingInspector inspector
+  )
+  {
+final List adaptedArgs = makeAdaptedArgs(args, inspector);
+return makeFallbackProcessor(
+() -> function.apply(adaptedArgs, UnusedBinding.INSTANCE),
+adaptedArgs,
+function.getOutputType(inspector, args),
+inspector
+);
+  }
+
+  /**
+   * Create a processor for a non-vectorizable {@link 
ExprMacroTable.ExprMacro}.
+   */
+  public static  FallbackVectorProcessor create(
+  final ExprMacroTable.ExprMacro macro,
+  final List args,
+  final Expr.VectorInputBindingInspector inspector
+  )
+  {
+final List adaptedArgs = makeAdaptedArgs(args, inspector);
+final Expr adaptedExpr = macro.apply(adaptedArgs);
+return makeFallbackProcessor(
+() -> adaptedExpr.eval(UnusedBinding.INSTANCE),
+adaptedArgs,
+adaptedExpr.getOutputType(inspector),
+inspector
+);
+  }
+
+  /**
+   * Helper for the two {@link #create} methods. Makes {@link AdaptedExpr} 
that can replace the original args to
+   * the {@link Expr} that requires fallback.
+   *
+   * @param args  args to the original expr
+   * @param inspector binding inspector
+   *
+   * @return list of {@link AdaptedExpr}
+   */
+  private static List makeAdaptedArgs(
+  final List args,
+  final Expr.VectorInputBindingInspector inspector
+  )
+  {
+final List adaptedArgs = new ArrayList<>(args.size());
+
+for (final Expr arg : args) {
+  adaptedArgs.add(new AdaptedExpr(arg.asVectorProcessor(inspector), arg));
+}
+
+return adaptedArgs;
+  }
+
+  /**
+   * Helper for the two {@link #create} methods.
+   *
+   * @param fn  eval function that uses the "adaptedArgs" as inputs
+   * @param adaptedArgs adapted args from {@link #makeAdaptedArgs(List, 
Expr.VectorInputBindingInspector)}
+   * @param outputType  output type of the eval from "fn"
+   * @param inspector   binding inspector
+   */
+  @SuppressWarnings({"unchecked", "rawtypes"})
+ 

Re: [PR] rework cursor creation (druid)

2024-06-05 Thread via GitHub


clintropolis commented on code in PR #16533:
URL: https://github.com/apache/druid/pull/16533#discussion_r1627197171


##
processing/src/main/java/org/apache/druid/segment/CursorFactory.java:
##
@@ -35,12 +38,55 @@
  */
 public interface CursorFactory
 {
+  default CursorMaker asCursorMaker(CursorBuildSpec spec)
+  {
+
+return new CursorMaker()
+{
+  @Override
+  public boolean canVectorize()
+  {
+return CursorFactory.this.canVectorize(spec.getFilter(), 
spec.getVirtualColumns(), spec.isDescending());
+  }
+
+  @Override
+  public Sequence makeCursors()
+  {
+return CursorFactory.this.makeCursors(
+spec.getFilter(),
+spec.getInterval(),
+spec.getVirtualColumns(),
+spec.getGranularity(),
+spec.isDescending(),
+spec.getQueryMetrics()
+);

Review Comment:
   this is intended so that existing `CursorFactory`/`StorageAdapter` 
implementations can keep working with query engines without implementing 
`asCursorMaker`



##
processing/src/main/java/org/apache/druid/segment/CursorFactory.java:
##
@@ -35,12 +38,55 @@
  */
 public interface CursorFactory
 {
+  default CursorMaker asCursorMaker(CursorBuildSpec spec)
+  {
+
+return new CursorMaker()
+{
+  @Override
+  public boolean canVectorize()
+  {
+return CursorFactory.this.canVectorize(spec.getFilter(), 
spec.getVirtualColumns(), spec.isDescending());
+  }
+
+  @Override
+  public Sequence makeCursors()
+  {
+return CursorFactory.this.makeCursors(
+spec.getFilter(),
+spec.getInterval(),
+spec.getVirtualColumns(),
+spec.getGranularity(),
+spec.isDescending(),
+spec.getQueryMetrics()
+);
+  }
+
+  @Override
+  public VectorCursor makeVectorCursor()
+  {
+return CursorFactory.this.makeVectorCursor(
+spec.getFilter(),
+spec.getInterval(),
+spec.getVirtualColumns(),
+spec.isDescending(),
+spec.getQueryContext().getVectorSize(),
+spec.getQueryMetrics()
+);

Review Comment:
   this is intended so that existing `CursorFactory`/`StorageAdapter` 
implementations can keep working with query engines without implementing 
`asCursorMaker`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] rework cursor creation (druid)

2024-06-05 Thread via GitHub


clintropolis commented on code in PR #16533:
URL: https://github.com/apache/druid/pull/16533#discussion_r1627196668


##
processing/src/main/java/org/apache/druid/segment/CursorFactory.java:
##
@@ -35,12 +38,55 @@
  */
 public interface CursorFactory
 {
+  default CursorMaker asCursorMaker(CursorBuildSpec spec)
+  {
+
+return new CursorMaker()
+{
+  @Override
+  public boolean canVectorize()
+  {
+return CursorFactory.this.canVectorize(spec.getFilter(), 
spec.getVirtualColumns(), spec.isDescending());

Review Comment:
   this is intended so that existing `CursorFactory`/`StorageAdapter` 
implementations can keep working with query engines without implementing 
`asCursorMaker`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



Re: [PR] Optimise indexer resource usage based on CPU profile (druid)

2024-06-05 Thread via GitHub


kgyrtkirk commented on code in PR #16517:
URL: https://github.com/apache/druid/pull/16517#discussion_r1627164179


##
processing/src/main/java/org/apache/druid/segment/DictionaryEncodedColumnMerger.java:
##
@@ -322,7 +322,7 @@ public void processMergedRow(ColumnValueSelector selector) 
throws IOException
 if (encodedValueSerializer instanceof ColumnarMultiIntsSerializer) {
   ((ColumnarMultiIntsSerializer) encodedValueSerializer).addValues(row);
 } else {
-  int value = row.size() == 0 ? 0 : row.get(0);
+  int value = rowSize == 0 ? 0 : row.get(0);

Review Comment:
   oh - that's true; and its being used as well around here!
   
   but how it works now is that it will call `getRow` for every `size` and 
`get(i)` invocations as well [see 
here](https://github.com/apache/druid/blob/30c59042e0cde3c76392afa3ba73d0830b11b0a5/processing/src/main/java/org/apache/druid/segment/DictionaryEncodedColumnIndexer.java#L161)
   
   the costly process is to produce the `row`  itself; currently built by: 
[StringDimensionIndexer$IndexerDimensionSelector#getRow](https://github.com/apache/druid/blob/30c59042e0cde3c76392afa3ba73d0830b11b0a5/processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java#L304)
 - if that could be cached somehow - that would result in the same benefits.
   There are better access to internals there; I believe if the "same" entry is 
being used (`currEntry.get()`) then it might be ok to return the previously 
built row.
   Although the 
[IncrementalIndexRow](https://github.com/apache/druid/blob/30c59042e0cde3c76392afa3ba73d0830b11b0a5/processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexRow.java#L110)
  looks partially mutable; but the since the `timestamp` can't be set: I think 
its not intended to be reused - so I think you could put a caching logic there 
based on whether `currEntry.get()` is the same
   
   to ensure that I'll don't send you on a wild goose chase I'll ask 
@clintropolis to possibly doublecheck the above thinking :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



  1   2   3   4   5   6   7   8   9   10   >