Re: [PR] docs: Migration guide for MVDs to arrays (druid)
vtlim commented on code in PR #16516: URL: https://github.com/apache/druid/pull/16516#discussion_r1630030691 ## docs/release-info/migr-mvd-array.md: ## @@ -0,0 +1,135 @@ +--- +id: migr-mvd-array +title: "Migration guide: MVDs to arrays" +sidebar_label: MVDs to arrays +--- + + + + +Druid now supports SQL-compliant [arrays](../querying/arrays.md), and we recommend that people use arrays over [multi-value dimensions](../querying/multi-value-dimensions.md) (MVDs) whenever possible. +Use arrays for new projects and complex use cases involving multiple data types. Use MVDs for specific use cases, such as operating directly on individual elements like regular strings. If your operations involve complete arrays of values, including the ordering of values within a row, use arrays over MVDs. + +## Comparison between arrays and MVDs + +The following table compares the general behavior between arrays and MVDs. +For specific query differences between arrays and MVDs, see [Query differences between arrays and MVDs](#query-differences-between-arrays-and-mvds). + +| | Arrays| Multi-value dimensions (MVDs) | +|---|---|---| +| Data types | Supports VARCHAR, BIGINT, and DOUBLE types (ARRAY, ARRAY, ARRAY) | Only supports arrays of strings (VARCHAR) | +| SQL compliance | Behaves like standard SQL arrays with SQL-compliant behavior | Does not behave like standard SQL arrays; requires special SQL functions | +| Ingestion | JSON arrays are ingested as Druid arraysManaged through the query context parameter `arrayIngestMode` in SQL-based ingestion (supported options: `array`, `mvd`, `none`). Note that if you set this mode to `none`, Druid raises an exception if you try to store any type of array. | JSON arrays are ingested as multi-value dimensionsManaged using functions like [ARRAY_TO_MV](../querying/sql-functions.md#array_to_mv) in SQL-based ingestion | +| Filtering and grouping | Filters and groupings match the entire array valueCan be used as GROUP BY keys, grouping based on the entire array value | Filters match any value within the arrayGrouping generates a group for each individual value, similar to an implicit UNNEST | +| Conversion | Convert an MVD to an array using [MV_TO_ARRAY](../querying/sql-functions.md#mv_to_array) | Convert an array to an MVD using [ARRAY_TO_MV](../querying/sql-functions.md#array_to_mv) | + +## Query differences between arrays and MVDs + +In SQL queries, Druid operates on arrays differently than MVDs. +A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR). + +For example, consider the same value, `['a', 'b', 'c']` ingested into an array column and an MVD column. +In your query, you want to filter results by comparing some value with `['a', 'b', 'c']`. + +* For array columns, Druid only returns the row when an equality filter matches the entire array. +For example: `WHERE "array_column" = ARRAY['a', 'b', 'c']`. + +* For MVD columns, Druid returns the row when an equality filter matches any value of the MVD. +For example, any of the following filters returns the row for the query: +`WHERE "mvd_column" = 'a'` +`WHERE "mvd_column" = 'b'` +`WHERE "mvd_column" = 'c'` + +Note this difference between arrays and MVDs when you write queries that involve filtering or grouping. + +The following examples highlight a few analogous queries between arrays and MVDs. +For more information and examples, see [Querying arrays](../querying/arrays.md#querying-arrays) and [Querying multi-value dimensions](../querying/multi-value-dimensions.md#querying-multi-value-dimensions). + +### Example: an element in array + +Filter rows that have a certain value in the array or MVD. + + Array + +```sql +SELECT * +FROM "array_example" +WHERE ARRAY_CONTAINS(tags, 't3') +``` + + MVD + +```sql +SELECT * +FROM "mvd_example" +WHERE tags = 't3' +``` + +### Example: overlap of two arrays + +Filter rows for which the array or MVD overlaps a reference array. + + Array + +```sql +SELECT * +FROM "array_example" +WHERE ARRAY_OVERLAP(tags, ARRAY['t1', 't7']) +``` + + MVD + +```sql +SELECT * +FROM "mvd_example" +WHERE MV_OVERLAP(tags, ARRAY['t1', 't7']) +``` Review Comment: @2bethere you mentioned adding an example for ARRAY_OVERLAP but these examples aren't so different between MVDs and arrays. Did you have another use case in mind? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Update site for Druid 30 [druid-website]
adarshsanjeev opened a new pull request, #277: URL: https://github.com/apache/druid-website/pull/277 Update 30.0.0 staging site -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] docs: remove outdated druidversion var from a page (druid)
317brian opened a new pull request, #16570: URL: https://github.com/apache/druid/pull/16570 The `{{DRUIDVERSION}}` bits on this page were added awhile ago: Line 368: last year (26.0.0) Line 434: 4 years ago I think they should have been hardcoded at the time since the statements mark a change in behavior. But since they were both so long ago, it seems like we can just remove them instead of trying to track down the version they correspond to. At a minimum, I think we should remove the 4 yr old reference. If we wanted to, we can update line 368 to the actual version This PR has: - [x] been self-reviewed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Support Dynamic Peon Pod Template Selection in K8s extension (druid)
suneet-s commented on code in PR #16510: URL: https://github.com/apache/druid/pull/16510#discussion_r1630003561 ## extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/execution/ExecutionConfig.java: ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.k8s.overlord.execution; + +import com.fasterxml.jackson.annotation.JsonSubTypes; +import com.fasterxml.jackson.annotation.JsonTypeInfo; + +/** + * Represents the configuration for task execution within a Kubernetes environment. + * This interface allows for dynamic configuration of task execution strategies based + * on specified behavior strategies. + */ +@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = DefaultExecutionConfig.class) +@JsonSubTypes(value = { +@JsonSubTypes.Type(name = "default", value = DefaultExecutionConfig.class) +}) +public interface ExecutionConfig +{ + String CONFIG_KEY = "k8s.taskrunner.config"; Review Comment: YongGang and I discussed this offline, and I better understand the intent of this config object. It seems like this is trying to provide similar functionality as `KubernetesTaskRunnerConfig` but via the dynamic config. It makes sense to have an encompassing dynamic config object for this extension. Some suggested names for this that better indicate it's purpose `KubernetesTaskRunnerDynamicConfig` `KubernetesPeonDynamicConfig` `KubernetesTaskExecutionConfig` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Support Dynamic Peon Pod Template Selection in K8s extension (druid)
suneet-s commented on code in PR #16510: URL: https://github.com/apache/druid/pull/16510#discussion_r1630003561 ## extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/execution/ExecutionConfig.java: ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.k8s.overlord.execution; + +import com.fasterxml.jackson.annotation.JsonSubTypes; +import com.fasterxml.jackson.annotation.JsonTypeInfo; + +/** + * Represents the configuration for task execution within a Kubernetes environment. + * This interface allows for dynamic configuration of task execution strategies based + * on specified behavior strategies. + */ +@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = DefaultExecutionConfig.class) +@JsonSubTypes(value = { +@JsonSubTypes.Type(name = "default", value = DefaultExecutionConfig.class) +}) +public interface ExecutionConfig +{ + String CONFIG_KEY = "k8s.taskrunner.config"; Review Comment: YongGang and I discussed this offline, and I better understand the intent of this config object. It seems like this is trying to provide similar functionality as `KubernetesTaskRunnerConfig` but via the dynamic config. It makes sense to have an encompassing dynamic config object for this extension. Some suggested names for this that better indicate it's purpose `KubernetesTaskRunnerDynamicConfig` `KubernetesPeonDynamicConfig` `KubernetesTaskExecutionConfig` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Update site for Druid 30 [druid-website-src]
adarshsanjeev opened a new pull request, #479: URL: https://github.com/apache/druid-website-src/pull/479 Update 30.0.0 staging site -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Add site for Druid 30 [druid-website-src]
adarshsanjeev closed pull request #478: Add site for Druid 30 URL: https://github.com/apache/druid-website-src/pull/478 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [I] Bring Azure functionality to core functionality (druid)
zachjsh closed issue #9240: Bring Azure functionality to core functionality URL: https://github.com/apache/druid/issues/9240 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [I] Bring Azure functionality to core functionality (druid)
zachjsh commented on issue #9240: URL: https://github.com/apache/druid/issues/9240#issuecomment-2153081065 This work was done in https://github.com/apache/druid/pull/9394, and previous commits that improved quality to this extension. Closing this out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Move Azure extension into Core (druid)
amaechler commented on PR #9394: URL: https://github.com/apache/druid/pull/9394#issuecomment-2152892043 Closes #9240. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
kfaraz commented on PR #16481: URL: https://github.com/apache/druid/pull/16481#issuecomment-2152891872 Happy to help, @Akshat-Jain ! Thank you for the changes and for the discussion! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] [Backport] Fix backwards compatibility with centralized schema config in partial_index_merge tasks (druid)
cryptoe merged PR #16566: URL: https://github.com/apache/druid/pull/16566 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Pin Curator dependencies to 5.3.0 util CURATOR-696 has been resolved (druid)
cryptoe closed pull request #16444: Pin Curator dependencies to 5.3.0 util CURATOR-696 has been resolved URL: https://github.com/apache/druid/pull/16444 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Pin Curator dependencies to 5.3.0 util CURATOR-696 has been resolved (druid)
cryptoe commented on PR #16444: URL: https://github.com/apache/druid/pull/16444#issuecomment-2152787612 Since we are already on the path for : https://github.com/apache/druid/pull/16528 closing this ticke.t -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
Akshat-Jain commented on code in PR #16481: URL: https://github.com/apache/druid/pull/16481#discussion_r1629670678 ## extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.storage.s3.output; + +import com.amazonaws.services.s3.model.UploadPartRequest; +import com.amazonaws.services.s3.model.UploadPartResult; +import com.google.common.annotations.VisibleForTesting; +import com.google.inject.Inject; +import org.apache.druid.guice.ManageLifecycle; +import org.apache.druid.java.util.common.RetryUtils; +import org.apache.druid.java.util.common.concurrent.Execs; +import org.apache.druid.java.util.common.lifecycle.LifecycleStart; +import org.apache.druid.java.util.common.lifecycle.LifecycleStop; +import org.apache.druid.java.util.common.logger.Logger; +import org.apache.druid.storage.s3.S3Utils; +import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3; +import org.apache.druid.utils.RuntimeInfo; + +import java.io.File; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Future; + +/** + * This class manages uploading files to S3 in chunks, while ensuring that the + * number of chunks currently present on local disk does not exceed a specific limit. + */ +@ManageLifecycle +public class S3UploadManager +{ + private final ExecutorService uploadExecutor; + + private static final Logger log = new Logger(S3UploadManager.class); + + @Inject + public S3UploadManager(S3OutputConfig s3OutputConfig, S3ExportConfig s3ExportConfig, RuntimeInfo runtimeInfo) + { +int poolSize = Math.max(4, runtimeInfo.getAvailableProcessors()); +int maxNumConcurrentChunks = computeMaxNumConcurrentChunks(s3OutputConfig, s3ExportConfig); +this.uploadExecutor = createExecutorService(poolSize, maxNumConcurrentChunks); +log.info("Initialized executor service for S3 multipart upload with pool size [%d] and work queue capacity [%d]", + poolSize, maxNumConcurrentChunks); + } + + /** + * Computes the maximum number of concurrent chunks for an S3 multipart upload. + * We want to determine the maximum number of concurrent chunks on disk based on the maximum value of chunkSize + * between the 2 configs: S3OutputConfig and S3ExportConfig. + * + * @param s3OutputConfig The S3 output configuration, which may specify a custom chunk size. + * @param s3ExportConfig The S3 export configuration, which may also specify a custom chunk size. + * @return The maximum number of concurrent chunks. + */ + @VisibleForTesting + int computeMaxNumConcurrentChunks(S3OutputConfig s3OutputConfig, S3ExportConfig s3ExportConfig) Review Comment: Resolving this comment as per the discussion at https://github.com/apache/druid/pull/16481#discussion_r1629278464 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
Akshat-Jain commented on PR #16481: URL: https://github.com/apache/druid/pull/16481#issuecomment-2152681440 @kfaraz Thank you for the extremely exhaustive review comments, got to learn a LOT from them throughout this PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [I] [DRAFT] 30.0.0 release notes (druid)
adarshsanjeev commented on issue #16505: URL: https://github.com/apache/druid/issues/16505#issuecomment-2152675032 @frankgrimes97 Currently, Druid 30 is at a stage where only regressions can be backported. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Remove incorrect utf8 conversion of ResultCache keys (druid)
github-advanced-security[bot] commented on code in PR #16569: URL: https://github.com/apache/druid/pull/16569#discussion_r1629654479 ## server/src/test/java/org/apache/druid/client/cache/CaffeineCacheTest.java: ## @@ -65,6 +69,16 @@ cache = CaffeineCache.create(cacheConfig); } + @Test + public void testKeyContainingNegativeBytest() Review Comment: ## Missing Override annotation This method overrides [CacheTestBase.testKeyContainingNegativeBytest](1); it is advisable to add an Override annotation. [Show more details](https://github.com/apache/druid/security/code-scanning/7440) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [I] /v2 and /v2/candidates endpoints not respecting broker partition pruning - range partition (druid)
ColeAtCharter commented on issue #16222: URL: https://github.com/apache/druid/issues/16222#issuecomment-2152602070 @gianm -- thanks for looking at this. I do think "Unexpected result example 1: /druid/v2 endpoint" should never happen if broker partition pruning is working. Druid returned a segmentId that should have been pruned at the broker. That should not happen. The one hypothetical "exception" is if the broker is combining the data query results with its own metadata about segments (before pruning) and including a dummy data result. However, this is a bit of a stretch. Our team can rerun the test and see if there are more metrics to elucidate the extent to which the data queries are fanned out even with partition pruning enabled -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Remove incorrect utf8 conversion of ResultCache keys (druid)
kgyrtkirk opened a new pull request, #16569: URL: https://github.com/apache/druid/pull/16569 * remove the incorrect call to read the byte[] as utf8 * not sure about its origin - but this will push the contract for `Cache` implementations to handle the `byte[]` in `NamedKey` - its already a `byte[]` - so they shouldn't expect any better than that... * use `Cache.NamedKey` instead of passing `byte[]` and creating the key at multiple places * added some test to ensure that cache implementations are able to accept such keys Fixes #16552 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
kfaraz commented on code in PR #16481: URL: https://github.com/apache/druid/pull/16481#discussion_r1629424508 ## extensions-core/s3-extensions/src/test/java/org/apache/druid/storage/s3/output/S3UploadManagerTest.java: ## @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.storage.s3.output; + +import com.amazonaws.services.s3.model.UploadPartRequest; +import com.amazonaws.services.s3.model.UploadPartResult; +import org.apache.druid.java.util.common.HumanReadableBytes; +import org.apache.druid.java.util.common.concurrent.Execs; +import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3; +import org.apache.druid.utils.RuntimeInfo; +import org.easymock.Capture; +import org.easymock.EasyMock; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.io.File; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Future; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +public class S3UploadManagerTest +{ + + private S3UploadManager s3UploadManager; + private S3OutputConfig s3OutputConfig; + private S3ExportConfig s3ExportConfig; + private static ExecutorService uploadExecutor; Review Comment: this should not be static. ## extensions-core/s3-extensions/src/test/java/org/apache/druid/storage/s3/output/S3UploadManagerTest.java: ## @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.storage.s3.output; + +import com.amazonaws.services.s3.model.UploadPartRequest; +import com.amazonaws.services.s3.model.UploadPartResult; +import org.apache.druid.java.util.common.HumanReadableBytes; +import org.apache.druid.java.util.common.concurrent.Execs; +import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3; +import org.apache.druid.utils.RuntimeInfo; +import org.easymock.Capture; +import org.easymock.EasyMock; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.io.File; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Future; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +public class S3UploadManagerTest +{ + + private S3UploadManager s3UploadManager; + private S3OutputConfig s3OutputConfig; + private S3ExportConfig s3ExportConfig; + private static ExecutorService uploadExecutor; + + @Before + public void setUp() + { +s3OutputConfig = EasyMock.mock(S3OutputConfig.class); +s3ExportConfig = EasyMock.mock(S3ExportConfig.class); Review Comment: Please use concrete objects instead of mocks. `S3OutputConfig` and `S3ExportConfig` are plain POJOs that can be easily constructed. ## extensions-core/s3-extensions/src/test/java/org/apache/druid/storage/s3/output/S3UploadManagerTest.java: ## @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
kfaraz commented on code in PR #16481: URL: https://github.com/apache/druid/pull/16481#discussion_r1629419675 ## extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.storage.s3.output; + +import com.amazonaws.services.s3.model.UploadPartRequest; +import com.amazonaws.services.s3.model.UploadPartResult; +import com.google.common.annotations.VisibleForTesting; +import com.google.inject.Inject; +import org.apache.druid.guice.ManageLifecycle; +import org.apache.druid.java.util.common.RetryUtils; +import org.apache.druid.java.util.common.concurrent.Execs; +import org.apache.druid.java.util.common.lifecycle.LifecycleStart; +import org.apache.druid.java.util.common.lifecycle.LifecycleStop; +import org.apache.druid.java.util.common.logger.Logger; +import org.apache.druid.storage.s3.S3Utils; +import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3; +import org.apache.druid.utils.RuntimeInfo; + +import java.io.File; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Future; + +/** + * This class manages uploading files to S3 in chunks, while ensuring that the + * number of chunks currently present on local disk does not exceed a specific limit. + */ +@ManageLifecycle +public class S3UploadManager +{ + private final ExecutorService uploadExecutor; + + private static final Logger log = new Logger(S3UploadManager.class); + + @Inject + public S3UploadManager(S3OutputConfig s3OutputConfig, S3ExportConfig s3ExportConfig, RuntimeInfo runtimeInfo) + { +int poolSize = Math.max(4, runtimeInfo.getAvailableProcessors()); +int maxNumConcurrentChunks = computeMaxNumConcurrentChunks(s3OutputConfig, s3ExportConfig); +this.uploadExecutor = createExecutorService(poolSize, maxNumConcurrentChunks); +log.info("Initialized executor service for S3 multipart upload with pool size [%d] and work queue capacity [%d]", + poolSize, maxNumConcurrentChunks); + } + + /** + * Computes the maximum number of concurrent chunks for an S3 multipart upload. + * We want to determine the maximum number of concurrent chunks on disk based on the maximum value of chunkSize + * between the 2 configs: S3OutputConfig and S3ExportConfig. + * + * @param s3OutputConfig The S3 output configuration, which may specify a custom chunk size. + * @param s3ExportConfig The S3 export configuration, which may also specify a custom chunk size. + * @return The maximum number of concurrent chunks. + */ + @VisibleForTesting + int computeMaxNumConcurrentChunks(S3OutputConfig s3OutputConfig, S3ExportConfig s3ExportConfig) + { +long chunkSize = S3OutputConfig.S3_MULTIPART_UPLOAD_MIN_PART_SIZE_BYTES; +if (s3OutputConfig != null && s3OutputConfig.getChunkSize() != null) { + chunkSize = Math.max(chunkSize, s3OutputConfig.getChunkSize()); +} +if (s3ExportConfig != null && s3ExportConfig.getChunkSize() != null) { + chunkSize = Math.max(chunkSize, s3ExportConfig.getChunkSize().getBytes()); +} + +return (int) (S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES / chunkSize); Review Comment: Sounds good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Update pac4j version (druid)
pagrawal10 opened a new pull request, #16567: URL: https://github.com/apache/druid/pull/16567 Fixes #. ### Description Fixed the bug ... Renamed the class ... Added a forbidden-apis entry ... Release note # Key changed/added classes in this PR * `MyFoo` * `OurBar` * `TheirBaz` This PR has: - [ ] been self-reviewed. - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.) - [ ] added documentation for new or modified features or behaviors. - [ ] a release note entry in the PR description. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [ ] added integration tests. - [ ] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
Akshat-Jain commented on code in PR #16481: URL: https://github.com/apache/druid/pull/16481#discussion_r1629278464 ## extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.storage.s3.output; + +import com.amazonaws.services.s3.model.UploadPartRequest; +import com.amazonaws.services.s3.model.UploadPartResult; +import com.google.common.annotations.VisibleForTesting; +import com.google.inject.Inject; +import org.apache.druid.guice.ManageLifecycle; +import org.apache.druid.java.util.common.RetryUtils; +import org.apache.druid.java.util.common.concurrent.Execs; +import org.apache.druid.java.util.common.lifecycle.LifecycleStart; +import org.apache.druid.java.util.common.lifecycle.LifecycleStop; +import org.apache.druid.java.util.common.logger.Logger; +import org.apache.druid.storage.s3.S3Utils; +import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3; +import org.apache.druid.utils.RuntimeInfo; + +import java.io.File; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Future; + +/** + * This class manages uploading files to S3 in chunks, while ensuring that the + * number of chunks currently present on local disk does not exceed a specific limit. + */ +@ManageLifecycle +public class S3UploadManager +{ + private final ExecutorService uploadExecutor; + + private static final Logger log = new Logger(S3UploadManager.class); + + @Inject + public S3UploadManager(S3OutputConfig s3OutputConfig, S3ExportConfig s3ExportConfig, RuntimeInfo runtimeInfo) + { +int poolSize = Math.max(4, runtimeInfo.getAvailableProcessors()); +int maxNumConcurrentChunks = computeMaxNumConcurrentChunks(s3OutputConfig, s3ExportConfig); +this.uploadExecutor = createExecutorService(poolSize, maxNumConcurrentChunks); +log.info("Initialized executor service for S3 multipart upload with pool size [%d] and work queue capacity [%d]", + poolSize, maxNumConcurrentChunks); + } + + /** + * Computes the maximum number of concurrent chunks for an S3 multipart upload. + * We want to determine the maximum number of concurrent chunks on disk based on the maximum value of chunkSize + * between the 2 configs: S3OutputConfig and S3ExportConfig. + * + * @param s3OutputConfig The S3 output configuration, which may specify a custom chunk size. + * @param s3ExportConfig The S3 export configuration, which may also specify a custom chunk size. + * @return The maximum number of concurrent chunks. + */ + @VisibleForTesting + int computeMaxNumConcurrentChunks(S3OutputConfig s3OutputConfig, S3ExportConfig s3ExportConfig) + { +long chunkSize = S3OutputConfig.S3_MULTIPART_UPLOAD_MIN_PART_SIZE_BYTES; +if (s3OutputConfig != null && s3OutputConfig.getChunkSize() != null) { + chunkSize = Math.max(chunkSize, s3OutputConfig.getChunkSize()); +} +if (s3ExportConfig != null && s3ExportConfig.getChunkSize() != null) { + chunkSize = Math.max(chunkSize, s3ExportConfig.getChunkSize().getBytes()); +} + +return (int) (S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES / chunkSize); Review Comment: @kfaraz I had a discussion with Karan about it. The conclusion was that we don't need to block this PR for this review comment, and a follow-up task could be to add a metric for gathering data on how long are the processing threads actually blocked on this constraint. We could then tune this accordingly based on the gathered data. Hope that works! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] [Backport] Fix backwards compatibility with centralized schema config in partial_index_merge tasks (druid)
findingrish opened a new pull request, #16566: URL: https://github.com/apache/druid/pull/16566 Backports https://github.com/apache/druid/pull/16556 & https://github.com/apache/druid/pull/16565 to Druid30. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Log segment signature in AbstractSegmentMetadataCache at debug level (druid)
cryptoe merged PR #16565: URL: https://github.com/apache/druid/pull/16565 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Log segment signature in AbstractSegmentMetadataCache at debug level (druid)
findingrish opened a new pull request, #16565: URL: https://github.com/apache/druid/pull/16565 Logging signature for each segment could result in extensive logging, hence changing the log level to debug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [I] Broker OOM Error During SQL Execution. (druid)
lijie749808 closed issue #15343: Broker OOM Error During SQL Execution. URL: https://github.com/apache/druid/issues/15343 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Fix backwards compatibility with centralized schema config in partial_index_merge tasks (druid)
cryptoe merged PR #16556: URL: https://github.com/apache/druid/pull/16556 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Add IncrementalIndex test (druid)
kgyrtkirk commented on code in PR #16562: URL: https://github.com/apache/druid/pull/16562#discussion_r1628970678 ## extensions-contrib/spectator-histogram/src/test/java/org/apache/druid/spectator/histogram/SpectatorHistogramAggregatorTest.java: ## @@ -365,6 +383,60 @@ public void testBuildingAndCountingHistograms() throws Exception Assert.assertEquals(9.0, (Double) results.get(0).get(1), 0.001); } + @Test + public void testBuildingAndCountingHistogramsIncrementalIndex() throws Exception + { +NullHandling.initializeForTestsWithValues(true, true); +List dimensions = Collections.singletonList("d"); +int n = 10; +DateTime startOfDay = DateTime.now(DateTimeZone.UTC).withTimeAtStartOfDay(); +List inputRows = new ArrayList<>(n); +for (int i = 1; i <= n; i++) { + String val = String.valueOf(i * 1.0d); + + inputRows.add(new MapBasedInputRow( + startOfDay.plusMinutes(i), + dimensions, + ImmutableMap.of("x", i, "d", val) + )); +} + +IncrementalIndex index = AggregationTestHelper.createIncrementalIndex( +inputRows.iterator(), +new NoopInputRowParser(null), +new AggregatorFactory[]{ +new CountAggregatorFactory("count"), +new SpectatorHistogramAggregatorFactory("histogram", "x") +}, +0, +Granularities.NONE, +100, +false +); + +ImmutableList segments = ImmutableList.of( +new IncrementalIndexSegment(index, SegmentId.dummy("test")), +helper.persistIncrementalIndex(index, null) +); + +GroupByQuery query = new GroupByQuery.Builder() +.setDataSource("test") +.setGranularity(Granularities.HOUR) +.setInterval("1970/2050") +.setAggregatorSpecs( +new DoubleSumAggregatorFactory("doubleSum", "histogram") Review Comment: I don't really see why this is beneficial at all - why can't the user invoke a method which will translate the histogram to a literal and the aggregate those values with sum? I think an aggregate which is repesented by a sketch or something could normally not be represented by a single literal number; if we want to introduce something which translates the sketch to some other type automatically - I think that should be registered properly and not hidden in the aggregator as some bonus functionality. am I alone with this mindset? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Attempt to coerce COMPLEX to number in numeric aggregators. (druid)
gianm commented on PR #16564: URL: https://github.com/apache/druid/pull/16564#issuecomment-2151419801 @bsyk @maytasm note I changed the last assert from #16562 from: ``` Assert.assertEquals(n, (Double) results.get(0).get(1), 0.001); ``` to: ``` Assert.assertEquals(n * segments.size(), (Double) results.get(0).get(1), 0.001); ``` Please let me know if this looks right. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Attempt to coerce COMPLEX to number in numeric aggregators. (druid)
gianm opened a new pull request, #16564: URL: https://github.com/apache/druid/pull/16564 PR #15371 eliminated ObjectColumnSelector's built-in implementations of numeric methods, which had been marked deprecated. However, some complex types, like SpectatorHistogram, can be successfully coerced to number. The documentation for spectator histograms encourages taking advantage of this by aggregating complex columns with doubleSum and longSum. Currently, this doesn't work properly for IncrementalIndex, where the behavior relied on those deprecated ObjectColumnSelector methods. This patch fixes the behavior by making two changes: 1) SimpleXYZAggregatorFactory (XYZ = type; base class for simple numeric aggregators; all of these extend NullableNumericAggregatorFactory) use getObject for STRING and COMPLEX. Previously, getObject was only used for STRING. 2) NullableNumericAggregatorFactory (base class for simple numeric aggregators) has a new protected method "useGetObject". This allows the base class to correctly check for null (using getObject or isNull). The patch also adds a test for SpectatorHistogram + doubleSum + IncrementalIndex. Thanks @bsyk for the test, which I pulled from #16562. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Add logs and metrics for append segments upgraded by a concurrent replace task (druid)
kfaraz opened a new pull request, #16563: URL: https://github.com/apache/druid/pull/16563 ### Changes - Add metrics `segment/upgraded/count` and `segment/upgradedRealtime/count` emitted with dimensions `taskId`, `dataSource`, `taskType`, etc. - Add new logs and cleanup existing ones - Minor method renames -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
Akshat-Jain commented on code in PR #16481: URL: https://github.com/apache/druid/pull/16481#discussion_r1628749775 ## extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java: ## @@ -103,27 +94,28 @@ public class RetryableS3OutputStream extends OutputStream private boolean error; private boolean closed; - public RetryableS3OutputStream( - S3OutputConfig config, - ServerSideEncryptingAmazonS3 s3, - String s3Key - ) throws IOException - { + /** + * Helper class for calculating maximum number of simultaneous chunks allowed on local disk. + */ + private final S3UploadManager uploadManager; -this(config, s3, s3Key, true); - } + /** + * A list of futures to allow us to wait for completion of all uploadPart() calls + * before hitting {@link ServerSideEncryptingAmazonS3#completeMultipartUpload(CompleteMultipartUploadRequest)}. + */ + private final List> futures = new ArrayList<>(); Review Comment: Marking this resolved as per the discussion at https://github.com/apache/druid/pull/16481#discussion_r1628218203 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Fallback vectorization for FunctionExpr and BaseMacroFunctionExpr. (druid)
gianm merged PR #16366: URL: https://github.com/apache/druid/pull/16366 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Fix serde for ArrayOfDoublesSketchConstantPostAggregator. (druid)
gianm merged PR #16550: URL: https://github.com/apache/druid/pull/16550 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [I] Issue with PostAggregator arrayOfDoublesSketchConstant in latest Druid 29.0.1 (druid)
gianm closed issue #16539: Issue with PostAggregator arrayOfDoublesSketchConstant in latest Druid 29.0.1 URL: https://github.com/apache/druid/issues/16539 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Simplify serialized form of JsonInputFormat. (druid)
gianm merged PR #15691: URL: https://github.com/apache/druid/pull/15691 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Column name in parse exceptions (druid)
TSFenwick commented on code in PR #16529: URL: https://github.com/apache/druid/pull/16529#discussion_r1628629431 ## processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java: ## @@ -330,25 +329,67 @@ public static Long convertObjectToLong(@Nullable Object valObj, boolean reportPa } else if (valObj instanceof String) { Long ret = DimensionHandlerUtils.getExactLongFromDecimalString((String) valObj); if (reportParseExceptions && ret == null) { -throw new ParseException((String) valObj, "could not convert value [%s] to long", valObj); +final String message = (objectKey != null) ? StringUtils.nonStrictFormat( +"could not convert value [%s] to long for dimension [%s]", +valObj, +objectKey +) : StringUtils.nonStrictFormat( +"could not convert value [%s] to long", +valObj +); +throw new ParseException((String) valObj, message); } return ret; } else if (valObj instanceof List) { - throw new ParseException( - valObj.getClass().toString(), + final String message = (objectKey != null) ? StringUtils.nonStrictFormat( + "Could not ingest value [%s] as long for dimension [%s]. A long column cannot have multiple values in the same row.", + valObj, + objectKey + ) : StringUtils.nonStrictFormat( "Could not ingest value %s as long. A long column cannot have multiple values in the same row.", Review Comment: should the list value format string have the `[]`'s around it? it would end up looking like [[1,2,3]]. its confusing to me as a developer to know if it should have a square bracket or not depending on the type... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Extension to read and ingest iceberg data files (druid)
a2l007 commented on PR #14329: URL: https://github.com/apache/druid/pull/14329#issuecomment-2151175460 @maytasm Yeah, we should definitely call that out in the docs and also print some warning error messages. I'll make sure to include these in my next PR. Regarding filter pushdown, are you suggesting to do a filter pushdown to Druid in all cases or only when the iceberg filter is on a non-partitioned column? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Add interface method for returning canonical lookup name (druid)
abhishekrb19 commented on code in PR #16557: URL: https://github.com/apache/druid/pull/16557#discussion_r1628399933 ## processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java: ## @@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider * Returns a lookup container for the provided lookupName, if it exists. */ Optional get(String lookupName); + + /** + * Returns the canonical lookup name from a lookup name. + */ + String getCanonicalLookupName(String lookupName); Review Comment: @LakshSingla, good points. I think we need to clarify the Javadoc a bit more to clarify the intent of the new method and/or rename accordingly. We can also call it `getLookupName(String name)` too, but that's a bit ambiguous imo. Naming, thoughts? As far as 1 is concerned, the interface isn't annotated `@PublicAPI` or `@ExtensionsPoint`. Also, this PR https://github.com/apache/druid/pull/9281/ changed the method signature and added a new interface method without default implementations. Following that pattern, I believe it's safe to add new methods without providing a default implementation. In general, my understanding is that for custom extensions that directly use/extend interfaces that are not public APIs (i.e., not annotated with `PublicAPI` or `ExtensionsPoint`), the onus is on the developers maintaining custom implementations to sync with changes coming from upstream. For 2, `LookupSchema` is powering the SQL view for the lookups configured in the system. For consistency, I think it'd make sense to also call the new function there as you pointed out. I'm okay with doing that as a follow up too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [I] Druid query to get last N transactions from the db (druid)
github-actions[bot] commented on issue #14914: URL: https://github.com/apache/druid/issues/14914#issuecomment-2151156678 This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the d...@druid.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Add IncrementalIndex test (druid)
bsyk opened a new pull request, #16562: URL: https://github.com/apache/druid/pull/16562 Test that should repro the IncrementalIndex + ComplexColumn + longSum aggregator issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] More validation for Azure account config (druid)
amaechler opened a new pull request, #16561: URL: https://github.com/apache/druid/pull/16561 ### Description This PR marks two additional properties in the Azure deep storage config as required (`@NotNull`) (also see [the docs](https://druid.apache.org/docs/latest/development/extensions-core/azure/)): - `druid.azure.account` - `druid.azure.container` Both these properties must be set. I also cleaned up the `AzureAccountConfig` class a bit (no more equality overrides, use JSON serialization instead in tests to verify equality), group getters / setters together, and add some Javadocs to the `getBlobStorageEndpoint` helper function. This PR has: - [x] been self-reviewed. - [x] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Router: Authorize permissionless internal requests. (druid)
gianm merged PR #16419: URL: https://github.com/apache/druid/pull/16419 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Fallback vectorization for FunctionExpr and BaseMacroFunctionExpr. (druid)
gianm commented on code in PR #16366: URL: https://github.com/apache/druid/pull/16366#discussion_r1628508359 ## processing/src/main/java/org/apache/druid/math/expr/vector/FallbackVectorProcessor.java: ## @@ -0,0 +1,422 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.math.expr.vector; + +import org.apache.druid.error.DruidException; +import org.apache.druid.math.expr.Expr; +import org.apache.druid.math.expr.ExprEval; +import org.apache.druid.math.expr.ExprMacroTable; +import org.apache.druid.math.expr.ExprType; +import org.apache.druid.math.expr.ExpressionType; +import org.apache.druid.math.expr.Function; + +import javax.annotation.Nullable; +import java.util.ArrayList; +import java.util.List; +import java.util.Objects; +import java.util.function.Supplier; + +/** + * Implementation of {@link ExprVectorProcessor} that adapts non-vectorized {@link Expr#eval(Expr.ObjectBinding)}. + * This allows non-vectorized expressions to participate in vectorized queries. + */ +public abstract class FallbackVectorProcessor implements ExprVectorProcessor +{ + final Supplier> fn; + final List adaptedArgs; + + private final ExpressionType outputType; + + private FallbackVectorProcessor( + final Supplier> fn, + final List adaptedArgs, + final ExpressionType outputType + ) + { +this.fn = fn; +this.adaptedArgs = adaptedArgs; +this.outputType = outputType; + } + + /** + * Create a processor for a non-vectorizable {@link Function}. + */ + public static FallbackVectorProcessor create( + final Function function, + final List args, + final Expr.VectorInputBindingInspector inspector + ) + { +final List adaptedArgs = makeAdaptedArgs(args, inspector); +return makeFallbackProcessor( +() -> function.apply(adaptedArgs, UnusedBinding.INSTANCE), +adaptedArgs, +function.getOutputType(inspector, args), +inspector +); + } + + /** + * Create a processor for a non-vectorizable {@link ExprMacroTable.ExprMacro}. + */ + public static FallbackVectorProcessor create( + final ExprMacroTable.ExprMacro macro, + final List args, + final Expr.VectorInputBindingInspector inspector + ) + { +final List adaptedArgs = makeAdaptedArgs(args, inspector); +final Expr adaptedExpr = macro.apply(adaptedArgs); +return makeFallbackProcessor( +() -> adaptedExpr.eval(UnusedBinding.INSTANCE), +adaptedArgs, +adaptedExpr.getOutputType(inspector), +inspector +); + } + + /** + * Helper for the two {@link #create} methods. Makes {@link AdaptedExpr} that can replace the original args to + * the {@link Expr} that requires fallback. + * + * @param args args to the original expr + * @param inspector binding inspector + * + * @return list of {@link AdaptedExpr} + */ + private static List makeAdaptedArgs( + final List args, + final Expr.VectorInputBindingInspector inspector + ) + { +final List adaptedArgs = new ArrayList<>(args.size()); + +for (final Expr arg : args) { + adaptedArgs.add(new AdaptedExpr(arg.asVectorProcessor(inspector), arg)); +} + +return adaptedArgs; + } + + /** + * Helper for the two {@link #create} methods. + * + * @param fn eval function that uses the "adaptedArgs" as inputs + * @param adaptedArgs adapted args from {@link #makeAdaptedArgs(List, Expr.VectorInputBindingInspector)} + * @param outputType output type of the eval from "fn" + * @param inspector binding inspector + */ + @SuppressWarnings({"unchecked", "rawtypes"}) + private static FallbackVectorProcessor makeFallbackProcessor( + final Supplier> fn, + final List adaptedArgs, + final ExpressionType outputType, + final Expr.VectorInputBindingInspector inspector + ) + { +if (outputType == null) { + throw DruidException.defensive("Plan has null outputType"); +} else if (outputType.equals(ExpressionType.LONG)) { + return (FallbackVectorProcessor) new OfLong(fn, (List) adaptedArgs, outputType, inspector); +} else if (outputType.equals(ExpressionType.DOUBLE)) { + return
Re: [I] Problem with array, UNNEST and JSON_PARSE (druid)
gianm closed issue #16543: Problem with array, UNNEST and JSON_PARSE URL: https://github.com/apache/druid/issues/16543 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Fix capabilities reported by UnnestStorageAdapter. (druid)
gianm merged PR #16551: URL: https://github.com/apache/druid/pull/16551 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[I] Query with JSON_QUERY/JSON_VALUE using parameters across a join fails to be planned (druid)
tls-applied opened a new issue, #16560: URL: https://github.com/apache/druid/issues/16560 ### Description Sample query: ```sql SELECT JSON_VALUE(a.data, b.path) FROM ( VALUES (PARSE_JSON('{"x": 1}')) ) a(data) CROSS JOIN ( VALUES ('$.x') ) b(path) ``` Relevant errors: ``` 2024-06-05T21:36:40,185 WARN [sql[67983b92-e41d-4933-be2e-a172b0f43148]] org.apache.druid.sql.calcite.planner.QueryHandler - Query not supported. Please check Broker logs for additional details. SQL was: SELECT JSON_VALUE(a.data, b.path) FROM ( VALUES (PARSE_JSON('{"x": 1}')) ) a(data) CROSS JOIN ( VALUES ('$.x') ) b(path) LIMIT 11 (org.apache.calcite.plan.RelOptPlanner$CannotPlanException: There are not enough rules to produce a node with desired properties: convention=DRUID, sort=[]. Missing conversion is LogicalSort[convention: NONE -> DRUID] There is 1 empty subset: rel#53574:Subset#5.DRUID.[], the relevant part of the original plan is as follows 53572:LogicalSort(fetch=[11]) 53570:LogicalProject(subset=[rel#53571:Subset#4.NONE.[]], EXPR$0=[CAST(JSON_VALUE_VARCHAR($0, $1)):VARCHAR(2000)]) 53568:LogicalJoin(subset=[rel#53569:Subset#3.NONE.[]], condition=[true], joinType=[inner]) 53565:LogicalProject(subset=[rel#53566:Subset#1.NONE.[]], EXPR$0=[PARSE_JSON('{"x": 1}')]) 53537:LogicalValues(subset=[rel#53564:Subset#0.NONE.[0]], tuples=[[{ 0 }]]) 53539:LogicalValues(subset=[rel#53567:Subset#2.NONE.[0]], tuples=[[{ '$.x' }]]) Root: rel#53574:Subset#5.DRUID.[] Original rel: LogicalSort(subset=[rel#53574:Subset#5.DRUID.[]], fetch=[11]): rowcount = 1.0, cumulative cost = {1.0 rows, 4.0 cpu, 0.0 io}, id = 53572 LogicalProject(subset=[rel#53571:Subset#4.NONE.[]], EXPR$0=[CAST(JSON_VALUE_VARCHAR($0, $1)):VARCHAR(2000)]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 53570 LogicalJoin(subset=[rel#53569:Subset#3.NONE.[]], condition=[true], joinType=[inner]): rowcount = 1.0, cumulative cost = {1.0 rows, 0.0 cpu, 0.0 io}, id = 53568 LogicalProject(subset=[rel#53566:Subset#1.NONE.[]], EXPR$0=[PARSE_JSON('{"x": 1}')]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 53565 LogicalValues(subset=[rel#53564:Subset#0.NONE.[0]], tuples=[[{ 0 }]]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 53537 LogicalValues(subset=[rel#53567:Subset#2.NONE.[0]], tuples=[[{ '$.x' }]]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 53539 ) ``` Is this expected to fail and is there a way to work around it? ### Affected Version 26.0.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Add interface method for returning canonical lookup name (druid)
abhishekrb19 commented on PR #16557: URL: https://github.com/apache/druid/pull/16557#issuecomment-2150998873 Merging this PR. The two suggestions from @LakshSingla can be addressed in a follow-up. Thanks for the fix, @Akshat-Jain! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Add interface method for returning canonical lookup name (druid)
abhishekrb19 merged PR #16557: URL: https://github.com/apache/druid/pull/16557 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Add interface method for returning canonical lookup name (druid)
abhishekrb19 commented on code in PR #16557: URL: https://github.com/apache/druid/pull/16557#discussion_r1628399933 ## processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java: ## @@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider * Returns a lookup container for the provided lookupName, if it exists. */ Optional get(String lookupName); + + /** + * Returns the canonical lookup name from a lookup name. + */ + String getCanonicalLookupName(String lookupName); Review Comment: @LakshSingla, good points. I think we need to clarify the Javadoc a bit more to clarify the intent of the new method and/or rename accordingly. We can also call it `getLookupName(String name)` too, but that's a bit ambiguous imo. Naming, thoughts? As far as 1 is concerned, the interface isn't annotated `@PublicAPI` or `@ExtensionsPoint`. Also, this PR https://github.com/apache/druid/pull/9281/ changed the method signature and added a new interface method without default implementations. Following that pattern, I believe it's safe to add new methods without providing a default implementation. In general, my understanding is that for custom extensions that directly use/extend interfaces that are not public APIs (i.e., not annotated with `PublicAPI` or `ExtensionsPoint`), the onus is be on the developers maintaining custom implementations. For 2, `LookupSchema` is powering the SQL view for the lookups configured in the system. For consistency, I think it'd make sense to also call the new function there as you pointed out. I'm okay with doing that as a follow up too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Only create container in `AzureStorage` for write operations (druid)
amaechler commented on code in PR #16558: URL: https://github.com/apache/druid/pull/16558#discussion_r1628461028 ## extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java: ## @@ -56,255 +54,391 @@ */ public class AzureStorage { - // Default value from Azure library private static final int DELTA_BACKOFF_MS = 30_000; - // https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch?tabs=microsoft-entra-id + // https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch#request-body private static final Integer MAX_MULTI_OBJECT_DELETE_SIZE = 256; - - private static final Logger log = new Logger(AzureStorage.class); + private static final Logger LOG = new Logger(AzureStorage.class); private final AzureClientFactory azureClientFactory; private final String defaultStorageAccount; public AzureStorage( - AzureClientFactory azureClientFactory, - @Nullable String defaultStorageAccount + final AzureClientFactory azureClientFactory, + final String defaultStorageAccount ) { this.azureClientFactory = azureClientFactory; this.defaultStorageAccount = defaultStorageAccount; } - public List emptyCloudBlobDirectory(final String containerName, final String virtualDirPath) + /** + * See {@link AzureStorage#emptyCloudBlobDirectory(String, String, Integer)} for details. + */ + public List emptyCloudBlobDirectory(final String containerName, @Nullable final String prefix) Review Comment: Let me add a comment, that's a good idea. I found it easier understanding the code when there was a bit less indirection. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Only create container in `AzureStorage` for write operations (druid)
amaechler commented on code in PR #16558: URL: https://github.com/apache/druid/pull/16558#discussion_r1628458947 ## extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java: ## @@ -56,255 +54,391 @@ */ public class AzureStorage Review Comment: That's a valid question. - `getBlobContainerClient` does nothing yet. It simply creates a client that will be used for an API call later. - `createBlobContainerIfNotExists` first [calls](https://github.com/Azure/azure-sdk-for-java/blob/5648561dc792a993ec0fb4b041d76dc64971b9bb/sdk/storage/azure-storage-blob/src/main/java/com/azure/storage/blob/BlobServiceClient.java#L206-L211) `getBlobContainerClient`, and then right away executes an API call (`createIfNotExistsWithResponse`) for that client. In the current implementation, if you _configure_ a container that does not exist, the calls that I found call `azureStorage.getBlockBlobExists` first, which then simply returns `false` (called for streaming task files, or from the storage layer). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimise indexer resource usage based on CPU profile (druid)
clintropolis commented on code in PR #16517: URL: https://github.com/apache/druid/pull/16517#discussion_r1628457954 ## processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java: ## @@ -132,12 +142,14 @@ public EncodedKeyComponent processRowValsToUnsortedEncodedKeyComponent(@N } else if (dimValues instanceof byte[]) { encodedDimensionValues = new int[]{dimLookup.add(emptyToNullIfNeeded(StringUtils.encodeBase64String((byte[]) dimValues)))}; + dictionaryChanged = true; } else { encodedDimensionValues = new int[]{dimLookup.add(emptyToNullIfNeeded(dimValues))}; + dictionaryChanged = true; } // If dictionary size has changed, the sorted lookup is no longer valid. -if (oldDictSize != dimLookup.size()) { +if (dictionaryChanged) { Review Comment: >so that if its able to recall an earlier value that's actually very important ...this change will destroy the sortedLookup cache even in case all calls to add were hits. The lookup is only sorted when persisting a segment, so `add` shouldn't really happen after the dictionary is sorted. I would guess this is here as a safeguard. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Only create container in `AzureStorage` for write operations (druid)
georgew5656 commented on code in PR #16558: URL: https://github.com/apache/druid/pull/16558#discussion_r1628455989 ## extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java: ## @@ -56,255 +54,391 @@ */ public class AzureStorage Review Comment: would it be possible to log a warning somewhere if we're using azure as deep storage and the container doesn't exist? where "using azure as deep storage"=getAzureStorageContainer is called (i don't think getAzureStorageContainer would be called if we were just loading the azure extension and using it to ingest data from azure). or maybe even try to create the container at this point instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Only create container in `AzureStorage` for write operations (druid)
georgew5656 commented on PR #16558: URL: https://github.com/apache/druid/pull/16558#issuecomment-2150976002 i think the reason the azure extension supports creating its own container is that the concept of a container does not have a 1-1 analogue to s3 or google storage. for example the extension requires that the storage account exist (where a storage account is a direct analogue to a s3 bucket) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Only create container in `AzureStorage` for write operations (druid)
georgew5656 commented on code in PR #16558: URL: https://github.com/apache/druid/pull/16558#discussion_r1628442038 ## extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java: ## @@ -56,255 +54,391 @@ */ public class AzureStorage { - // Default value from Azure library private static final int DELTA_BACKOFF_MS = 30_000; - // https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch?tabs=microsoft-entra-id + // https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch#request-body private static final Integer MAX_MULTI_OBJECT_DELETE_SIZE = 256; - - private static final Logger log = new Logger(AzureStorage.class); + private static final Logger LOG = new Logger(AzureStorage.class); private final AzureClientFactory azureClientFactory; private final String defaultStorageAccount; public AzureStorage( - AzureClientFactory azureClientFactory, - @Nullable String defaultStorageAccount + final AzureClientFactory azureClientFactory, + final String defaultStorageAccount ) { this.azureClientFactory = azureClientFactory; this.defaultStorageAccount = defaultStorageAccount; } - public List emptyCloudBlobDirectory(final String containerName, final String virtualDirPath) + /** + * See {@link AzureStorage#emptyCloudBlobDirectory(String, String, Integer)} for details. + */ + public List emptyCloudBlobDirectory(final String containerName, @Nullable final String prefix) Review Comment: i feel like it might be helpful to have two helper functions here (getContainerClient, getOrCreateContainerCilent) and comment on them what they are used for (readonly or r/w) for anyone who's implementing a new api call in this class to figure out which one to use. alternatively maybe just a comment at the top of the class? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Only create container in `AzureStorage` for write operations (druid)
georgew5656 commented on code in PR #16558: URL: https://github.com/apache/druid/pull/16558#discussion_r1628439982 ## extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java: ## @@ -56,255 +54,391 @@ */ public class AzureStorage Review Comment: thinking about it a bit more, the first write operation is always triggered by a task, either a msq task pushing intermediate segments, or the middleManager/overlord pushing task logs. this seems ok to me, the only class of things i can think of failing are like, e.g. the task logs cleaner. but if that throws a exception it would be fine since no tasks would have been run. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Only create container in `AzureStorage` for write operations (druid)
georgew5656 commented on code in PR #16558: URL: https://github.com/apache/druid/pull/16558#discussion_r1628430378 ## extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java: ## @@ -56,255 +54,391 @@ */ public class AzureStorage Review Comment: to be specific, what does getBlobContainerClient do if the container doesn't exist? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Only create container in `AzureStorage` for write operations (druid)
georgew5656 commented on code in PR #16558: URL: https://github.com/apache/druid/pull/16558#discussion_r1628428060 ## extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureStorage.java: ## @@ -56,255 +54,391 @@ */ public class AzureStorage Review Comment: is it possible for a cluster using azure as deep storage to first make a read-only call to azure (therefore not creating the container) and for the function to throw a exception instead of creating the container and returning a empty list or something? like e.g if i start a new cluster in azure with no container what's the first api call that gets made and what happens if its a read-only call? i think i would be more concerned with error handling, if we return a good exception that says "the container doesn't exist" or something like this, then the operator could just create the container at that point. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Disable event hubs when kafka extensions isn't loaded (druid)
amaechler opened a new pull request, #16559: URL: https://github.com/apache/druid/pull/16559 Fixes #10342. ### Description This PR disables the `Azure Event Hubs` tile in the web-console if the kafka extension isn't loaded. - [x] been self-reviewed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Only create container in `AzureStorage` for write operations (druid)
georgew5656 commented on code in PR #16558: URL: https://github.com/apache/druid/pull/16558#discussion_r1628400526 ## extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureDataSegmentKiller.java: ## @@ -43,7 +43,6 @@ public class AzureDataSegmentKiller implements DataSegmentKiller { private static final Logger log = new Logger(AzureDataSegmentKiller.class); - private static final Integer MAX_MULTI_OBJECT_DELETE_SIZE = 256; // https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch?tabs=microsoft-entra-id Review Comment: it looks like this logic was moved to AzureStorage and this was never cleaned up -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Add interface method for returning canonical lookup name (druid)
abhishekrb19 commented on code in PR #16557: URL: https://github.com/apache/druid/pull/16557#discussion_r1628399933 ## processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java: ## @@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider * Returns a lookup container for the provided lookupName, if it exists. */ Optional get(String lookupName); + + /** + * Returns the canonical lookup name from a lookup name. + */ + String getCanonicalLookupName(String lookupName); Review Comment: @LakshSingla, good points. I think we need to clarify the Javadoc a bit more to clarify the intent of the new method and/or rename accordingly. We can also call it `getLookupName(String name)` too, but that's a bit ambiguous imo. Naming, thoughts? As far as 1 is concerned, the interface isn't annotated `@PublicAPI` or `@ExtensionsPoint`. Also, this PR https://github.com/apache/druid/pull/9281/ changed the method signature and added a new interface method without default implementations. Following that pattern, I believe it's safe to add new methods without providing a default implementation. In general, my understanding is that for custom extensions that directly use/extend interfaces that are not public APIs (i.e., not annotated with `PublicAPI` or `ExtensionsPoint`), the onus is be on the developers maintaining custom implementations. For 2, `LookupSchema` is powering the SQL view for the lookups configured in the system. I think it'd make sense to also call the new function there as you pointed out. I'm okay with doing that as a follow up too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Only create container in `AzureStorage` for write operations (druid)
amaechler commented on code in PR #16558: URL: https://github.com/apache/druid/pull/16558#discussion_r1628375979 ## extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/output/AzureStorageConnector.java: ## @@ -196,7 +196,7 @@ public Iterator listDir(String dirName) throws IOException final String prefixBasePath = objectPath(dirName); List paths; try { - paths = azureStorage.listDir(config.getContainer(), prefixBasePath, config.getMaxRetry()); + paths = azureStorage.listBlobs(config.getContainer(), prefixBasePath, null, config.getMaxRetry()); Review Comment: Renamed `listDir` to use more specific Azure terminology inside the `Storage` wrapper. ## extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureUtils.java: ## @@ -37,8 +37,6 @@ */ public class AzureUtils { - - public static final String DEFAULT_AZURE_ENDPOINT_SUFFIX = "core.windows.net"; Review Comment: Unused. ## extensions-core/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureDataSegmentKiller.java: ## @@ -43,7 +43,6 @@ public class AzureDataSegmentKiller implements DataSegmentKiller { private static final Logger log = new Logger(AzureDataSegmentKiller.class); - private static final Integer MAX_MULTI_OBJECT_DELETE_SIZE = 256; // https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch?tabs=microsoft-entra-id Review Comment: This was not used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Only create container in `AzureStorage` for write operations (druid)
amaechler opened a new pull request, #16558: URL: https://github.com/apache/druid/pull/16558 - **Remove unused constants** - **Refactor getBlockBlobLength** - **Better link** - **Upper-case log** - **Mark defaultStorageAccount nullable** - **Do not always create a new container if it doesn't exist** - **Add lots of comments, group methods** - **Revert "Mark defaultStorageAccount nullable"** ### Description When ingesting data from Azure blob storage, you can currently only do so if your SAS token has `write` permissions, even though this is not required to list and read blob objects. The source of this bug lies in the fact that all calls in `AzureStorage` get a container client by calling `createBlobContainerIfNotExists`. This SDK method actually performs an API call that needs `write` permissions. Bugfix To fix this issue, I updated the calls getting container clients in `AzureStorage` to only use `createBlobContainerIfNotExists` for write operations, where it makes sense. Specifically, `getBlockBlobOutputStream` and `uploadBlockBlob` will continue to create a container if it doesn't exist. This is useful when using Azure blob storage for deep storage and/or task logs. > That said, this behaviour is somewhat unique to the Azure extension. In S3 for example, I believe the bucket needs to exist. Maybe it would make sense to deprecate this behaviour in the future to be more consistent across Druid, and require the container to exist. All the other calls (read blob storage, check if blob exists, list blobs in the container, delete blobs) will require the container to exist. I think this makes sense given the nature of these calls. Other changes I also cleaned up the `AzureStorage` class so it can be easier maintained in the future. Specifically, I - used Azure terminology throughout the class (e.g. `prefix` vs `virtualDirPath`) - added `@Nullable` annotations where parameters are optional - added a whole bunch of Javadoc # Key changed/added classes in this PR * `AzureStorage` This PR has: - [ ] been self-reviewed. - [x] added documentation for new or modified features or behaviors. - [ ] a release note entry in the PR description. - [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [x] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
kfaraz commented on code in PR #16481: URL: https://github.com/apache/druid/pull/16481#discussion_r1628226834 ## extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.storage.s3.output; + +import com.amazonaws.services.s3.model.UploadPartRequest; +import com.amazonaws.services.s3.model.UploadPartResult; +import com.google.common.annotations.VisibleForTesting; +import com.google.inject.Inject; +import org.apache.druid.guice.ManageLifecycle; +import org.apache.druid.java.util.common.RetryUtils; +import org.apache.druid.java.util.common.concurrent.Execs; +import org.apache.druid.java.util.common.lifecycle.LifecycleStart; +import org.apache.druid.java.util.common.lifecycle.LifecycleStop; +import org.apache.druid.java.util.common.logger.Logger; +import org.apache.druid.storage.s3.S3Utils; +import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3; +import org.apache.druid.utils.RuntimeInfo; + +import java.io.File; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Future; + +/** + * This class manages uploading files to S3 in chunks, while ensuring that the + * number of chunks currently present on local disk does not exceed a specific limit. + */ +@ManageLifecycle +public class S3UploadManager +{ + private final ExecutorService uploadExecutor; + + private static final Logger log = new Logger(S3UploadManager.class); + + @Inject + public S3UploadManager(S3OutputConfig s3OutputConfig, S3ExportConfig s3ExportConfig, RuntimeInfo runtimeInfo) + { +int poolSize = Math.max(4, runtimeInfo.getAvailableProcessors()); +int maxNumConcurrentChunks = computeMaxNumConcurrentChunks(s3OutputConfig, s3ExportConfig); +this.uploadExecutor = createExecutorService(poolSize, maxNumConcurrentChunks); +log.info("Initialized executor service for S3 multipart upload with pool size [%d] and work queue capacity [%d]", + poolSize, maxNumConcurrentChunks); + } + + /** + * Computes the maximum number of concurrent chunks for an S3 multipart upload. + * We want to determine the maximum number of concurrent chunks on disk based on the maximum value of chunkSize + * between the 2 configs: S3OutputConfig and S3ExportConfig. + * + * @param s3OutputConfig The S3 output configuration, which may specify a custom chunk size. + * @param s3ExportConfig The S3 export configuration, which may also specify a custom chunk size. + * @return The maximum number of concurrent chunks. + */ + @VisibleForTesting + int computeMaxNumConcurrentChunks(S3OutputConfig s3OutputConfig, S3ExportConfig s3ExportConfig) + { +long chunkSize = S3OutputConfig.S3_MULTIPART_UPLOAD_MIN_PART_SIZE_BYTES; +if (s3OutputConfig != null && s3OutputConfig.getChunkSize() != null) { + chunkSize = Math.max(chunkSize, s3OutputConfig.getChunkSize()); +} +if (s3ExportConfig != null && s3ExportConfig.getChunkSize() != null) { + chunkSize = Math.max(chunkSize, s3ExportConfig.getChunkSize().getBytes()); +} + +return (int) (S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES / chunkSize); Review Comment: 5GB doesn't seem like a bad default but limiting it to 5GB even when we might have much more disk available seems weird. Also, if 5GB is the limit we want to use, we should define a separate constant rather than reuse the constant for max size. > we should use the overall disk size available for this calculation Just to clarify, I have not suggested using the entire disk space for storing chunks, rather the portion actually allocated for this purpose. That allocation may either be a hard-coded value, a fixed percentage of the overall disk space or even a config. In any case, that amount of disk space should be passed as an argument to the `computeMaxNumChunksOnDisk` method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
kfaraz commented on code in PR #16481: URL: https://github.com/apache/druid/pull/16481#discussion_r1628218203 ## extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java: ## @@ -269,52 +213,74 @@ public void close() throws IOException // Closeables are closed in LIFO order closer.register(() -> { // This should be emitted as a metric + long totalChunkSize = (currentChunk.id - 1) * chunkSize + currentChunk.length(); Review Comment: It's fine for now. We can worry about it when we really need to do something with the chunk. For now, let's just stick with the list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Docs: Remove circular link (druid)
317brian merged PR #16553: URL: https://github.com/apache/druid/pull/16553 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Add interface method for returning canonical lookup name (druid)
Akshat-Jain commented on code in PR #16557: URL: https://github.com/apache/druid/pull/16557#discussion_r1628194891 ## processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java: ## @@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider * Returns a lookup container for the provided lookupName, if it exists. */ Optional get(String lookupName); + + /** + * Returns the canonical lookup name from a lookup name. + */ + String getCanonicalLookupName(String lookupName); Review Comment: @LakshSingla I had a discussion with @abhishekrb19 about the naming of this method. We concluded on getCanonicalLookupName(). Regarding the other comments: 1. I don't see the other methods marked as PublicApi either? So seems like an unrelated issue to this PR? Thoughts? 2. This would be used in a custom extension. Would raise a future PR once this merges where it would be more clear. But in general, if in future we add a lookup name with special syntaxing, then this would be needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Add interface method for returning canonical lookup name (druid)
Akshat-Jain commented on code in PR #16557: URL: https://github.com/apache/druid/pull/16557#discussion_r1628194891 ## processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java: ## @@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider * Returns a lookup container for the provided lookupName, if it exists. */ Optional get(String lookupName); + + /** + * Returns the canonical lookup name from a lookup name. + */ + String getCanonicalLookupName(String lookupName); Review Comment: @LakshSingla I had a discussion with @abhishekrb19 about the naming of this method. We concluded on getCanonicalLookupName(). Regarding the other comments: 1. I don't see the other methods marked as PublicApi either? So seems like an unrelated issue to this PR? Thoughts? 2. This would be used in Imply extension. Would raise an Imply PR once this merges where it would be more clear. But in general, if in future we add a lookup name with special syntaxing, then this would be needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Add interface method for returning canonical lookup name (druid)
LakshSingla commented on code in PR #16557: URL: https://github.com/apache/druid/pull/16557#discussion_r1628189672 ## processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java: ## @@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider * Returns a lookup container for the provided lookupName, if it exists. */ Optional get(String lookupName); + + /** + * Returns the canonical lookup name from a lookup name. + */ + String getCanonicalLookupName(String lookupName); Review Comment: naming nit: 1. This can be getLookupName(String name). Canonical is confusing, and isn't defined in the javadoc here. See point (2) below as well. other comments: 1. This method isn't marked as `PublicApi` however custom extensions can extend it for own lookup implementations. Please add a default implementation for the same. 2. When should this method be called. As of now, this is a super specialized method that is called in `QueryLookupOperatorConversion`. Perhaps it should be called out and named as such. As a developer, it is unclear to me when must I call it and when not. For example, why am I not calling it on https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/schema/LookupSchema.java#L61? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Add interface method for returning canonical lookup name (druid)
LakshSingla commented on code in PR #16557: URL: https://github.com/apache/druid/pull/16557#discussion_r1628189672 ## processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java: ## @@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider * Returns a lookup container for the provided lookupName, if it exists. */ Optional get(String lookupName); + + /** + * Returns the canonical lookup name from a lookup name. + */ + String getCanonicalLookupName(String lookupName); Review Comment: naming nit: 1. This can be getLookupName(String name). Canonical is confusing, and isn't defined in the javadocs. other comments: 1. This method isn't marked as `PublicApi` however custom extensions can extend it for own lookup implementations. Please add a default implementation for the same. 2. When should this method be called. As of now, this is a super specialized method that is called in `QueryLookupOperatorConversion`. Perhaps it should be called out and named as such. As a developer, it is unclear to me when must I call it and when not. For example, why am I not calling it on https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/schema/LookupSchema.java#L61? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Add interface method for returning canonical lookup name (druid)
abhishekrb19 commented on code in PR #16557: URL: https://github.com/apache/druid/pull/16557#discussion_r1628157297 ## processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactoryContainerProvider.java: ## @@ -40,4 +40,9 @@ public interface LookupExtractorFactoryContainerProvider * Returns a lookup container for the provided lookupName, if it exists. */ Optional get(String lookupName); + + /** + * Returns the canonical lookup name from a lookup name. Review Comment: ```suggestion * Returns the canonical lookup name from a given lookup name, if special syntax exists. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Add interface method for returning canonical lookup name (druid)
Akshat-Jain opened a new pull request, #16557: URL: https://github.com/apache/druid/pull/16557 ### Description This PR adds support for returning canonical lookup name in `LookupExtractorFactoryContainerProvider` interface. This would allow the implementations to return a different canonical name for the lookup if needed. This PR has: - [x] been self-reviewed. - [ ] added documentation for new or modified features or behaviors. - [ ] a release note entry in the PR description. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [ ] added integration tests. - [ ] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Do not pass centralized schema config to sub-tasks of index_parallel to ensure backwards compatibility (druid)
kfaraz commented on code in PR #16556: URL: https://github.com/apache/druid/pull/16556#discussion_r1628120211 ## indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/PartialGenericSegmentMergeTask.java: ## @@ -67,7 +67,7 @@ public PartialGenericSegmentMergeTask( @JsonProperty("numAttempts") final int numAttempts, // zero-based counting @JsonProperty("spec") final PartialSegmentMergeIngestionSpec ingestionSchema, @JsonProperty("context") final Map context, - @JsonProperty("centralizedDatasourceSchemaConfig") CentralizedDatasourceSchemaConfig centralizedDatasourceSchemaConfig, + @JsonProperty("centralizedDatasourceSchemaConfig") @Nullable CentralizedDatasourceSchemaConfig centralizedDatasourceSchemaConfig, Review Comment: All other usages of `CentralizedDatasourceSchemaConfig` are injected ones which are bound using the config path `druid.centralizedDatasourceSchema`. So they should all be non-null. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Handle null values of centralized schema config in PartialMergeTask for backwards compatibility (druid)
cryptoe commented on code in PR #16556: URL: https://github.com/apache/druid/pull/16556#discussion_r1628076745 ## indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/PartialGenericSegmentMergeTask.java: ## @@ -67,7 +67,7 @@ public PartialGenericSegmentMergeTask( @JsonProperty("numAttempts") final int numAttempts, // zero-based counting @JsonProperty("spec") final PartialSegmentMergeIngestionSpec ingestionSchema, @JsonProperty("context") final Map context, - @JsonProperty("centralizedDatasourceSchemaConfig") CentralizedDatasourceSchemaConfig centralizedDatasourceSchemaConfig, + @JsonProperty("centralizedDatasourceSchemaConfig") @Nullable CentralizedDatasourceSchemaConfig centralizedDatasourceSchemaConfig, Review Comment: Can we do a quick grep for CentralizedDatasourceSchemaConfig and see if we have nullable's everywhere ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Handle null values of centralized schema config in PartialMergeTask for backwards compatibility (druid)
cryptoe commented on code in PR #16556: URL: https://github.com/apache/druid/pull/16556#discussion_r1628076745 ## indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/PartialGenericSegmentMergeTask.java: ## @@ -67,7 +67,7 @@ public PartialGenericSegmentMergeTask( @JsonProperty("numAttempts") final int numAttempts, // zero-based counting @JsonProperty("spec") final PartialSegmentMergeIngestionSpec ingestionSchema, @JsonProperty("context") final Map context, - @JsonProperty("centralizedDatasourceSchemaConfig") CentralizedDatasourceSchemaConfig centralizedDatasourceSchemaConfig, + @JsonProperty("centralizedDatasourceSchemaConfig") @Nullable CentralizedDatasourceSchemaConfig centralizedDatasourceSchemaConfig, Review Comment: Can we do a quick grep for CentralizedDatasourceSchemaConfig and see we have nullable's everywhere ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [I] [DRAFT] 30.0.0 release notes (druid)
frankgrimes97 commented on issue #16505: URL: https://github.com/apache/druid/issues/16505#issuecomment-2150427461 @adarshsanjeev Could the following in-progress bugfix https://github.com/apache/druid/pull/16550 be considered for `30.0.0`? Not sure where you guys are in terms for release timeline/cadence but we'd ideally like to test/verify it on perhaps `30.0.0-rc3` instead of waiting for it to land in `31.0.0` in August/September. We'd also be open to having it considered for a possible `29.0.2` release. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Handle null values of centralized schema config in PartialMergeTask for backwards compatibility (druid)
kfaraz opened a new pull request, #16556: URL: https://github.com/apache/druid/pull/16556 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [I] Issue with PostAggregator arrayOfDoublesSketchConstant in latest Druid 29.0.1 (druid)
frankgrimes97 commented on issue #16539: URL: https://github.com/apache/druid/issues/16539#issuecomment-2150405429 @cryptoe Any chance a `29.0.2` or even a `30.0.0-rc3` release could be cut with this fix (once approved/CI tested/merged to `master` on PR-16550? Choosing between waiting until a September `31.0.0` release and forking/building our own distribution image for a bugfix feels a bit painful. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Fix incorrect comparator usage in FinalizingFieldAccessPostAggregator (druid)
kgyrtkirk opened a new pull request, #16555: URL: https://github.com/apache/druid/pull/16555 * Fixes #16554 by using the `TypeStrategy` instead of the `Aggregator`'s comparator. * removed the fallback case when the column was not found - I don't think that should be allowed to happen... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[I] Incorrect comparator usage in FinalizingPostAggregators (druid)
kgyrtkirk opened a new issue, #16554: URL: https://github.com/apache/druid/issues/16554 Queries like ``` For aggregators which have a non-trivial finalization `AggregatorFactory#getResultType() != AggregatorFactory#getIntermediateType()` (so essentialy those which will most likely need some work to be done). The comparator is derived for the underlying `aggregator` for the `postAggregator` [here](https://github.com/apache/druid/blob/30c59042e0cde3c76392afa3ba73d0830b11b0a5/processing/src/main/java/org/apache/druid/query/aggregation/post/FinalizingFieldAccessPostAggregator.java#L124) - so a comparision of finalized values might be tried with the aggregator's comparator. This could cause an exception in case the postAggregators output is being used to order the rows by it. This causes a ClassCastException ``` java.lang.ClassCastException: class java.lang.Float cannot be cast to class org.apache.druid.collections.SerializablePair (java.lang.Float is in module java.base of loader 'bootstrap'; org.apache.druid.collections.SerializablePair is in unnamed module of loader 'app') at java.base/java.util.Collections$ReverseComparator2.compare(Collections.java:5278) at java.base/java.util.Comparators$NullComparator.compare(Comparators.java:83) at org.apache.druid.query.topn.TopNNumericResultBuilder.lambda$0(TopNNumericResultBuilder.java:75) at java.base/java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:675) at java.base/java.util.PriorityQueue.siftUp(PriorityQueue.java:652) at java.base/java.util.PriorityQueue.offer(PriorityQueue.java:345) at java.base/java.util.PriorityQueue.add(PriorityQueue.java:326) at org.apache.druid.query.topn.TopNNumericResultBuilder.addEntry(TopNNumericResultBuilder.java:157) at org.apache.druid.query.topn.TopNNumericResultBuilder.addEntry(TopNNumericResultBuilder.java:1) at org.apache.druid.query.topn.PooledTopNAlgorithm.updateResults(PooledTopNAlgorithm.java:765) at org.apache.druid.query.topn.PooledTopNAlgorithm.updateResults(PooledTopNAlgorithm.java:1) at org.apache.druid.query.topn.BaseTopNAlgorithm.runWithCardinalityKnown(BaseTopNAlgorithm.java:121) at org.apache.druid.query.topn.BaseTopNAlgorithm.run(BaseTopNAlgorithm.java:82) at org.apache.druid.query.topn.TopNMapFn.apply(TopNMapFn.java:70) at org.apache.druid.query.topn.TopNQueryEngine.lambda$0(TopNQueryEngine.java:100) at org.apache.druid.java.util.common.guava.MappingAccumulator.accumulate(MappingAccumulator.java:40) at org.apache.druid.java.util.common.guava.FilteringAccumulator.accumulate(FilteringAccumulator.java:41) at org.apache.druid.java.util.common.guava.MappingAccumulator.accumulate(MappingAccumulator.java:40) at org.apache.druid.java.util.common.guava.BaseSequence.accumulate(BaseSequence.java:44) at org.apache.druid.java.util.common.guava.MappedSequence.accumulate(MappedSequence.java:43) at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50) at org.apache.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55) at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45) at org.apache.druid.java.util.common.guava.FilteredSequence.accumulate(FilteredSequence.java:45) at org.apache.druid.java.util.common.guava.MappedSequence.accumulate(MappedSequence.java:43) at org.apache.druid.java.util.common.guava.FilteredSequence.accumulate(FilteredSequence.java:45) at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50) at org.apache.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55) at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45) at org.apache.druid.query.spec.SpecificSegmentQueryRunner$1.accumulate(SpecificSegmentQueryRunner.java:98) at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50) at org.apache.druid.query.spec.SpecificSegmentQueryRunner.doNamed(SpecificSegmentQueryRunner.java:185) at org.apache.druid.query.spec.SpecificSegmentQueryRunner.access$1(SpecificSegmentQueryRunner.java:181) at org.apache.druid.query.spec.SpecificSegmentQueryRunner$2.wrap(SpecificSegmentQueryRunner.java:165) at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45) at org.apache.druid.java.util.common.guava.Sequence.toList(Sequence.java:87) at org.apache.druid.query.ChainedExecutionQueryRunner$1$3.call(ChainedExecutionQueryRunner.java:112) at org.apache.druid.query.ChainedExecutionQueryRunner$1$3.call(ChainedExecutionQueryRunner.java:1) at
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
Akshat-Jain commented on code in PR #16481: URL: https://github.com/apache/druid/pull/16481#discussion_r1627923604 ## extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java: ## @@ -269,52 +213,74 @@ public void close() throws IOException // Closeables are closed in LIFO order closer.register(() -> { // This should be emitted as a metric + long totalChunkSize = (currentChunk.id - 1) * chunkSize + currentChunk.length(); Review Comment: > To me, having the map seems cleaner as we know which future belongs to which chunk @kfaraz The returned future (if successful) contains UploadPartResult which contains the partNumber which is the chunk ID, if that helps. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[PR] Docs: Remove circular link (druid)
ektravel opened a new pull request, #16553: URL: https://github.com/apache/druid/pull/16553 Remove circular link. This PR has: - [x] been self-reviewed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
kfaraz commented on code in PR #16481: URL: https://github.com/apache/druid/pull/16481#discussion_r1627919294 ## extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java: ## @@ -269,52 +213,74 @@ public void close() throws IOException // Closeables are closed in LIFO order closer.register(() -> { // This should be emitted as a metric + long totalChunkSize = (currentChunk.id - 1) * chunkSize + currentChunk.length(); Review Comment: > It allows us to not have a Map unnecessarily To me, having the map seems cleaner as we know which future belongs to which chunk. I don't see how it would affect the rest of the code as we just need to put stuff in the map and then iterate over it in `completeMultipartUpload()` and then while determining the size (which could even just be returned by the `completeMultipartUpload()` method). That said, I don't really mind the current approach either. I guess the map is more of a personal preference in this case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimise indexer resource usage based on CPU profile (druid)
kgyrtkirk commented on code in PR #16517: URL: https://github.com/apache/druid/pull/16517#discussion_r1627535702 ## processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java: ## @@ -132,12 +142,14 @@ public EncodedKeyComponent processRowValsToUnsortedEncodedKeyComponent(@N } else if (dimValues instanceof byte[]) { encodedDimensionValues = new int[]{dimLookup.add(emptyToNullIfNeeded(StringUtils.encodeBase64String((byte[]) dimValues)))}; + dictionaryChanged = true; } else { encodedDimensionValues = new int[]{dimLookup.add(emptyToNullIfNeeded(dimValues))}; + dictionaryChanged = true; } // If dictionary size has changed, the sorted lookup is no longer valid. -if (oldDictSize != dimLookup.size()) { +if (dictionaryChanged) { Review Comment: I believe the main purpose of the `dimLookup` field is to provide a common place where the different strings get an assigned number - so that if its able to recall an earlier value that's actually very important ...this change will destroy the `sortedLookup` cache even in case all calls to `add` were hits. I think you should either remove the changes in this file - or find a different way to address this; one idea could be: to connect the `sortedLookup` more closely when an add happens: * move the sorted dictionary caching to `StringDimensionDictionary` * extract a protected `realAdd` from the current `add` in `DimensionDictionary` * implement `realAdd` to clear the cached `sorted` stuff in `StringDimensionDictionary` again; let me invite @clintropolis to see if He have a different idea :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimise indexer resource usage based on CPU profile (druid)
kgyrtkirk commented on code in PR #16517: URL: https://github.com/apache/druid/pull/16517#discussion_r1627535702 ## processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java: ## @@ -132,12 +142,14 @@ public EncodedKeyComponent processRowValsToUnsortedEncodedKeyComponent(@N } else if (dimValues instanceof byte[]) { encodedDimensionValues = new int[]{dimLookup.add(emptyToNullIfNeeded(StringUtils.encodeBase64String((byte[]) dimValues)))}; + dictionaryChanged = true; } else { encodedDimensionValues = new int[]{dimLookup.add(emptyToNullIfNeeded(dimValues))}; + dictionaryChanged = true; } // If dictionary size has changed, the sorted lookup is no longer valid. -if (oldDictSize != dimLookup.size()) { +if (dictionaryChanged) { Review Comment: I believe the main purpose of the `dimLookup` field is to provide a common place where the different strings get an assigned number - so that if its able to recall an earlier value that's actually very important ...this change will destroy the `sortedLookup` cache even in case all calls to `add` were hits. I think you should either remove the changes in this file - or find a different way to address this; one idea could be; to connect the `StringDimensionIndexer` class as a modification listener of `dimLookup` - so that when a real add happens the `sortedLookup` will automatically destructed: * move the sorted dictionary caching to `StringDimensionDictionary` * extract a protected `realAdd` from the current `add` in `DimensionDictionary` * implement `realAdd` to clear the cached `sorted` stuff in `StringDimensionDictionary` again; let me invite @clintropolis to see if He have a different idea :) ## processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java: ## @@ -93,14 +95,26 @@ public EncodedKeyComponent processRowValsToUnsortedEncodedKeyComponent(@N if (dimValues == null) { final int nullId = dimLookup.getId(null); Review Comment: you don't need the `if` in case you use `dimLookup.add(null)` instead of `getId` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Fix a deadlock that can happen when mutliple stages execute parallelly in a worker (druid)
asdf2014 commented on code in PR #16524: URL: https://github.com/apache/druid/pull/16524#discussion_r1627360761 ## processing/src/main/java/org/apache/druid/frame/processor/Bouncer.java: ## @@ -32,18 +32,45 @@ /** * Limiter for access to some resource. - * - * Used by {@link FrameProcessorExecutor#runAllFully} to limit the number of outstanding processors. + * + * The class performs the work of a "bouncer", limiting the number of threads that can concurrently access the guarded + * critical sections by the bouncer. The bouncer is initialized with the max number of threads that can enter the + * gaurded section(s) at a time. Review Comment: ```suggestion * guarded section(s) at a time. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
[I] Result level cache key collisions from utf8 encoding (druid)
gianm opened a new issue, #16552: URL: https://github.com/apache/druid/issues/16552 The `ResultLevelCachingQueryRunner` does this when it computes the cache key for a query: ``` final String cacheKeyStr = StringUtils.fromUtf8(strategy.computeResultLevelCacheKey(query)); ``` This is bad because not all byte strings can be represented as Java Strings. For example, the byte arrays `[63, -48, 0, 0, 0, 0, 0, 0]` and `[63, -24, 0, 0, 0, 0, 0, 0]` have the same representation when converted to Strings. Both have an invalid second character, which both get mapped to the replacement character. These byte arrays are the representations of doubles `0.25` and `0.75` respectively, which is how this was originally noticed (two queries had a cache key collision when they differed only in that one used `0.25` and one used `0.75` as a quantile parameter). To fix this we need to always use `byte[]`, never `String`, when dealing with cache keys. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
Akshat-Jain commented on code in PR #16481: URL: https://github.com/apache/druid/pull/16481#discussion_r1627296496 ## extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java: ## @@ -269,52 +213,74 @@ public void close() throws IOException // Closeables are closed in LIFO order closer.register(() -> { // This should be emitted as a metric + long totalChunkSize = (currentChunk.id - 1) * chunkSize + currentChunk.length(); Review Comment: It allows us to not have a Map unnecessarily and helps the rest of the code be cleaner and simpler, which I really liked. Can you please elaborate more on why you'd prefer adding them individually? Thanks! PS: Can the chunkSize change in the middle of the operation? (in case of node failures or rolling upgrades where we update the chunk size or something along those lines?) 樂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
Akshat-Jain commented on code in PR #16481: URL: https://github.com/apache/druid/pull/16481#discussion_r1627284471 ## extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.storage.s3.output; + +import com.amazonaws.services.s3.model.UploadPartRequest; +import com.amazonaws.services.s3.model.UploadPartResult; +import com.google.common.annotations.VisibleForTesting; +import com.google.inject.Inject; +import org.apache.druid.guice.ManageLifecycle; +import org.apache.druid.java.util.common.RetryUtils; +import org.apache.druid.java.util.common.concurrent.Execs; +import org.apache.druid.java.util.common.lifecycle.LifecycleStart; +import org.apache.druid.java.util.common.lifecycle.LifecycleStop; +import org.apache.druid.java.util.common.logger.Logger; +import org.apache.druid.storage.s3.S3Utils; +import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3; +import org.apache.druid.utils.RuntimeInfo; + +import java.io.File; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Future; + +/** + * This class manages uploading files to S3 in chunks, while ensuring that the + * number of chunks currently present on local disk does not exceed a specific limit. + */ +@ManageLifecycle +public class S3UploadManager +{ + private final ExecutorService uploadExecutor; + + private static final Logger log = new Logger(S3UploadManager.class); + + @Inject + public S3UploadManager(S3OutputConfig s3OutputConfig, S3ExportConfig s3ExportConfig, RuntimeInfo runtimeInfo) + { +int poolSize = Math.max(4, runtimeInfo.getAvailableProcessors()); +int maxNumConcurrentChunks = computeMaxNumConcurrentChunks(s3OutputConfig, s3ExportConfig); +this.uploadExecutor = createExecutorService(poolSize, maxNumConcurrentChunks); +log.info("Initialized executor service for S3 multipart upload with pool size [%d] and work queue capacity [%d]", + poolSize, maxNumConcurrentChunks); + } + + /** + * Computes the maximum number of concurrent chunks for an S3 multipart upload. + * We want to determine the maximum number of concurrent chunks on disk based on the maximum value of chunkSize + * between the 2 configs: S3OutputConfig and S3ExportConfig. + * + * @param s3OutputConfig The S3 output configuration, which may specify a custom chunk size. + * @param s3ExportConfig The S3 export configuration, which may also specify a custom chunk size. + * @return The maximum number of concurrent chunks. + */ + @VisibleForTesting + int computeMaxNumConcurrentChunks(S3OutputConfig s3OutputConfig, S3ExportConfig s3ExportConfig) + { +long chunkSize = S3OutputConfig.S3_MULTIPART_UPLOAD_MIN_PART_SIZE_BYTES; +if (s3OutputConfig != null && s3OutputConfig.getChunkSize() != null) { + chunkSize = Math.max(chunkSize, s3OutputConfig.getChunkSize()); +} +if (s3ExportConfig != null && s3ExportConfig.getChunkSize() != null) { + chunkSize = Math.max(chunkSize, s3ExportConfig.getChunkSize().getBytes()); +} + +return (int) (S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES / chunkSize); Review Comment: @kfaraz I was trying to do it similar to how it's done in `WorkerStorageParameters`. IIUC, in `WorkerStorageParameters`, we have a hard-coded size of 1 GB (`MINIMUM_SUPER_SORTER_TMP_STORAGE_BYTES = 1_000_000_000L;`) which is used instead of computing the overall available disk space. Also, I don't think we should use the overall disk size available for this calculation, and should rather have a hard-limit? I chose 5 GB as that's the maximum allowed chunk size by S3. In the default case, chunk size is 100 MB, so this means around 50 chunks allowed on disk at once, which seems fine? Thoughts? cc: @cryptoe -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at:
Re: [PR] Fix serde for ArrayOfDoublesSketchConstantPostAggregator. (druid)
gianm commented on code in PR #16550: URL: https://github.com/apache/druid/pull/16550#discussion_r1627259662 ## extensions-core/datasketches/src/test/java/org/apache/druid/query/aggregation/datasketches/tuple/ArrayOfDoublesSketchConstantPostAggregatorTest.java: ## @@ -66,6 +67,24 @@ public void testComparator() Assert.assertEquals(Comparators.alwaysEqual(), postAgg.getComparator()); } + @Test + public void testSerde() throws JsonProcessingException + { +final PostAggregator there = new ArrayOfDoublesSketchConstantPostAggregator( +"p", + "AQEJAwgBzJP/fwIAzT6NGdX0aWUOJvS5EIhpLwA=" +); +DefaultObjectMapper mapper = new DefaultObjectMapper(); +mapper.registerModules(new ArrayOfDoublesSketchModule().getJacksonModules()); +PostAggregator andBackAgain = mapper.readValue( +mapper.writeValueAsString(there), +PostAggregator.class +); Review Comment: Perhaps we could use some approach like `EqualsVerifier`. Whatever magic it does to verify `equals` and `hashCode` could possible be used to verify serde as well? I never looked inside it to see what it's doing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Fix serde for ArrayOfDoublesSketchConstantPostAggregator. (druid)
gianm commented on PR #16550: URL: https://github.com/apache/druid/pull/16550#issuecomment-2149208873 Fixed the excessive imports. Thanks for reviewing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [I] middleManager container is unable to connect a zookeeper container under bridge mode network (druid)
vladComan0 commented on issue #9522: URL: https://github.com/apache/druid/issues/9522#issuecomment-2149195180 @akamensky Any luck finding the answer? I got that same error out of nowhere. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)
kfaraz commented on code in PR #16481: URL: https://github.com/apache/druid/pull/16481#discussion_r1626257336 ## extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.storage.s3.output; + +import com.amazonaws.services.s3.model.UploadPartRequest; +import com.amazonaws.services.s3.model.UploadPartResult; +import com.google.common.annotations.VisibleForTesting; +import com.google.inject.Inject; +import org.apache.druid.guice.ManageLifecycle; +import org.apache.druid.java.util.common.RetryUtils; +import org.apache.druid.java.util.common.concurrent.Execs; +import org.apache.druid.java.util.common.lifecycle.LifecycleStart; +import org.apache.druid.java.util.common.lifecycle.LifecycleStop; +import org.apache.druid.java.util.common.logger.Logger; +import org.apache.druid.storage.s3.S3Utils; +import org.apache.druid.storage.s3.ServerSideEncryptingAmazonS3; +import org.apache.druid.utils.RuntimeInfo; + +import java.io.File; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Future; + +/** + * This class manages uploading files to S3 in chunks, while ensuring that the + * number of chunks currently present on local disk does not exceed a specific limit. + */ +@ManageLifecycle +public class S3UploadManager +{ + private final ExecutorService uploadExecutor; + + private static final Logger log = new Logger(S3UploadManager.class); + + @Inject + public S3UploadManager(S3OutputConfig s3OutputConfig, S3ExportConfig s3ExportConfig, RuntimeInfo runtimeInfo) + { +int poolSize = Math.max(4, runtimeInfo.getAvailableProcessors()); +int maxNumConcurrentChunks = computeMaxNumConcurrentChunks(s3OutputConfig, s3ExportConfig); +this.uploadExecutor = createExecutorService(poolSize, maxNumConcurrentChunks); +log.info("Initialized executor service for S3 multipart upload with pool size [%d] and work queue capacity [%d]", + poolSize, maxNumConcurrentChunks); + } + + /** + * Computes the maximum number of concurrent chunks for an S3 multipart upload. + * We want to determine the maximum number of concurrent chunks on disk based on the maximum value of chunkSize + * between the 2 configs: S3OutputConfig and S3ExportConfig. + * + * @param s3OutputConfig The S3 output configuration, which may specify a custom chunk size. + * @param s3ExportConfig The S3 export configuration, which may also specify a custom chunk size. + * @return The maximum number of concurrent chunks. + */ + @VisibleForTesting + int computeMaxNumConcurrentChunks(S3OutputConfig s3OutputConfig, S3ExportConfig s3ExportConfig) + { +long chunkSize = S3OutputConfig.S3_MULTIPART_UPLOAD_MIN_PART_SIZE_BYTES; +if (s3OutputConfig != null && s3OutputConfig.getChunkSize() != null) { + chunkSize = Math.max(chunkSize, s3OutputConfig.getChunkSize()); +} +if (s3ExportConfig != null && s3ExportConfig.getChunkSize() != null) { + chunkSize = Math.max(chunkSize, s3ExportConfig.getChunkSize().getBytes()); +} + +return (int) (S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES / chunkSize); Review Comment: Shouldn't the numerator be the overall disk size available for keeping chunks? The value used here `S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES` is 5gb which just seems to be the maximum object size limit imposed by S3. ## extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software
Re: [PR] Fallback vectorization for FunctionExpr and BaseMacroFunctionExpr. (druid)
clintropolis commented on code in PR #16366: URL: https://github.com/apache/druid/pull/16366#discussion_r1627216880 ## processing/src/main/java/org/apache/druid/query/filter/vector/StringObjectVectorValueMatcher.java: ## @@ -58,7 +58,7 @@ public ReadableVectorMatch match(final ReadableVectorMatch mask, boolean include for (int i = 0; i < mask.getSelectionSize(); i++) { final int rowNum = mask.getSelection()[i]; - if ((value == null && includeUnknown) || Objects.equals(value, vector[rowNum])) { + if ((vector[rowNum] == null && includeUnknown) || Objects.equals(value, vector[rowNum])) { Review Comment: nice :+1: ## processing/src/main/java/org/apache/druid/math/expr/vector/FallbackVectorProcessor.java: ## @@ -0,0 +1,422 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.math.expr.vector; + +import org.apache.druid.error.DruidException; +import org.apache.druid.math.expr.Expr; +import org.apache.druid.math.expr.ExprEval; +import org.apache.druid.math.expr.ExprMacroTable; +import org.apache.druid.math.expr.ExprType; +import org.apache.druid.math.expr.ExpressionType; +import org.apache.druid.math.expr.Function; + +import javax.annotation.Nullable; +import java.util.ArrayList; +import java.util.List; +import java.util.Objects; +import java.util.function.Supplier; + +/** + * Implementation of {@link ExprVectorProcessor} that adapts non-vectorized {@link Expr#eval(Expr.ObjectBinding)}. + * This allows non-vectorized expressions to participate in vectorized queries. + */ +public abstract class FallbackVectorProcessor implements ExprVectorProcessor +{ + final Supplier> fn; + final List adaptedArgs; + + private final ExpressionType outputType; + + private FallbackVectorProcessor( + final Supplier> fn, + final List adaptedArgs, + final ExpressionType outputType + ) + { +this.fn = fn; +this.adaptedArgs = adaptedArgs; +this.outputType = outputType; + } + + /** + * Create a processor for a non-vectorizable {@link Function}. + */ + public static FallbackVectorProcessor create( + final Function function, + final List args, + final Expr.VectorInputBindingInspector inspector + ) + { +final List adaptedArgs = makeAdaptedArgs(args, inspector); +return makeFallbackProcessor( +() -> function.apply(adaptedArgs, UnusedBinding.INSTANCE), +adaptedArgs, +function.getOutputType(inspector, args), +inspector +); + } + + /** + * Create a processor for a non-vectorizable {@link ExprMacroTable.ExprMacro}. + */ + public static FallbackVectorProcessor create( + final ExprMacroTable.ExprMacro macro, + final List args, + final Expr.VectorInputBindingInspector inspector + ) + { +final List adaptedArgs = makeAdaptedArgs(args, inspector); +final Expr adaptedExpr = macro.apply(adaptedArgs); +return makeFallbackProcessor( +() -> adaptedExpr.eval(UnusedBinding.INSTANCE), +adaptedArgs, +adaptedExpr.getOutputType(inspector), +inspector +); + } + + /** + * Helper for the two {@link #create} methods. Makes {@link AdaptedExpr} that can replace the original args to + * the {@link Expr} that requires fallback. + * + * @param args args to the original expr + * @param inspector binding inspector + * + * @return list of {@link AdaptedExpr} + */ + private static List makeAdaptedArgs( + final List args, + final Expr.VectorInputBindingInspector inspector + ) + { +final List adaptedArgs = new ArrayList<>(args.size()); + +for (final Expr arg : args) { + adaptedArgs.add(new AdaptedExpr(arg.asVectorProcessor(inspector), arg)); +} + +return adaptedArgs; + } + + /** + * Helper for the two {@link #create} methods. + * + * @param fn eval function that uses the "adaptedArgs" as inputs + * @param adaptedArgs adapted args from {@link #makeAdaptedArgs(List, Expr.VectorInputBindingInspector)} + * @param outputType output type of the eval from "fn" + * @param inspector binding inspector + */ + @SuppressWarnings({"unchecked", "rawtypes"}) +
Re: [PR] rework cursor creation (druid)
clintropolis commented on code in PR #16533: URL: https://github.com/apache/druid/pull/16533#discussion_r1627197171 ## processing/src/main/java/org/apache/druid/segment/CursorFactory.java: ## @@ -35,12 +38,55 @@ */ public interface CursorFactory { + default CursorMaker asCursorMaker(CursorBuildSpec spec) + { + +return new CursorMaker() +{ + @Override + public boolean canVectorize() + { +return CursorFactory.this.canVectorize(spec.getFilter(), spec.getVirtualColumns(), spec.isDescending()); + } + + @Override + public Sequence makeCursors() + { +return CursorFactory.this.makeCursors( +spec.getFilter(), +spec.getInterval(), +spec.getVirtualColumns(), +spec.getGranularity(), +spec.isDescending(), +spec.getQueryMetrics() +); Review Comment: this is intended so that existing `CursorFactory`/`StorageAdapter` implementations can keep working with query engines without implementing `asCursorMaker` ## processing/src/main/java/org/apache/druid/segment/CursorFactory.java: ## @@ -35,12 +38,55 @@ */ public interface CursorFactory { + default CursorMaker asCursorMaker(CursorBuildSpec spec) + { + +return new CursorMaker() +{ + @Override + public boolean canVectorize() + { +return CursorFactory.this.canVectorize(spec.getFilter(), spec.getVirtualColumns(), spec.isDescending()); + } + + @Override + public Sequence makeCursors() + { +return CursorFactory.this.makeCursors( +spec.getFilter(), +spec.getInterval(), +spec.getVirtualColumns(), +spec.getGranularity(), +spec.isDescending(), +spec.getQueryMetrics() +); + } + + @Override + public VectorCursor makeVectorCursor() + { +return CursorFactory.this.makeVectorCursor( +spec.getFilter(), +spec.getInterval(), +spec.getVirtualColumns(), +spec.isDescending(), +spec.getQueryContext().getVectorSize(), +spec.getQueryMetrics() +); Review Comment: this is intended so that existing `CursorFactory`/`StorageAdapter` implementations can keep working with query engines without implementing `asCursorMaker` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] rework cursor creation (druid)
clintropolis commented on code in PR #16533: URL: https://github.com/apache/druid/pull/16533#discussion_r1627196668 ## processing/src/main/java/org/apache/druid/segment/CursorFactory.java: ## @@ -35,12 +38,55 @@ */ public interface CursorFactory { + default CursorMaker asCursorMaker(CursorBuildSpec spec) + { + +return new CursorMaker() +{ + @Override + public boolean canVectorize() + { +return CursorFactory.this.canVectorize(spec.getFilter(), spec.getVirtualColumns(), spec.isDescending()); Review Comment: this is intended so that existing `CursorFactory`/`StorageAdapter` implementations can keep working with query engines without implementing `asCursorMaker` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org
Re: [PR] Optimise indexer resource usage based on CPU profile (druid)
kgyrtkirk commented on code in PR #16517: URL: https://github.com/apache/druid/pull/16517#discussion_r1627164179 ## processing/src/main/java/org/apache/druid/segment/DictionaryEncodedColumnMerger.java: ## @@ -322,7 +322,7 @@ public void processMergedRow(ColumnValueSelector selector) throws IOException if (encodedValueSerializer instanceof ColumnarMultiIntsSerializer) { ((ColumnarMultiIntsSerializer) encodedValueSerializer).addValues(row); } else { - int value = row.size() == 0 ? 0 : row.get(0); + int value = rowSize == 0 ? 0 : row.get(0); Review Comment: oh - that's true; and its being used as well around here! but how it works now is that it will call `getRow` for every `size` and `get(i)` invocations as well [see here](https://github.com/apache/druid/blob/30c59042e0cde3c76392afa3ba73d0830b11b0a5/processing/src/main/java/org/apache/druid/segment/DictionaryEncodedColumnIndexer.java#L161) the costly process is to produce the `row` itself; currently built by: [StringDimensionIndexer$IndexerDimensionSelector#getRow](https://github.com/apache/druid/blob/30c59042e0cde3c76392afa3ba73d0830b11b0a5/processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java#L304) - if that could be cached somehow - that would result in the same benefits. There are better access to internals there; I believe if the "same" entry is being used (`currEntry.get()`) then it might be ok to return the previously built row. Although the [IncrementalIndexRow](https://github.com/apache/druid/blob/30c59042e0cde3c76392afa3ba73d0830b11b0a5/processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexRow.java#L110) looks partially mutable; but the since the `timestamp` can't be set: I think its not intended to be reused - so I think you could put a caching logic there based on whether `currEntry.get()` is the same to ensure that I'll don't send you on a wild goose chase I'll ask @clintropolis to possibly doublecheck the above thinking :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org