xudong963 commented on code in PR #22024:
URL: https://github.com/apache/datafusion/pull/22024#discussion_r3224058038


##########
datafusion/datasource-parquet/src/sampling.rs:
##########
@@ -0,0 +1,540 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Random sampling primitives for parquet scans.
+//!
+//! [`ParquetSampling`] holds the sampling configuration carried on
+//! [`crate::source::ParquetSource`]. The two `apply_*_sampling`
+//! methods mutate a [`ParquetAccessPlan`] in place — they are invoked
+//! by the parquet [`crate::opener`] once the file footer is loaded.
+//!
+//! Selection within a row group is deterministic-but-random per
+//! `(file_index, row_group_index, fraction, cluster_size)`: the methods
+//! seed an `SmallRng` from a hash of those inputs so re-runs match.
+//! The caller supplies a stable `file_index` (for the parquet opener,
+//! that is the execution `partition_index`) so sampling is independent
+//! of the on-disk path string and is reproducible across environments.
+
+use crate::access_plan::ParquetAccessPlan;
+use parquet::arrow::arrow_reader::{RowSelection, RowSelector};
+use rand::Rng;
+
+/// Hierarchical sampling config for parquet scans.
+///
+/// All fractions are in `(0.0, 1.0]`. `None` (or `1.0`) means "no
+/// sampling".
+///
+///   * `row_group_fraction` — within each scanned file, keep this
+///     fraction of row groups. Decision made inside the opener after
+///     the footer is loaded so we sample by actual row-group index.
+///   * `row_fraction` — within each kept row group, keep this fraction
+///     of rows by translating to a `RowSelection` of K small contiguous
+///     windows spread across the row group. The parquet reader uses
+///     the page index to read only the data pages covering the
+///     selected rows, so this gives "page-level" IO savings without
+///     requiring per-column page alignment. Falls back to scanning
+///     whole pages if the page index is missing.
+///   * `row_cluster_size` — controls how the per-row-group target is
+///     split into contiguous windows. Smaller = more diversity, more
+///     page-index lookups; larger = cheaper, fewer regions covered.
+///
+/// **Why this lives here, not as a one-shot `ParquetAccessPlan`:** the
+/// natural entry-point for "I want a sample" is at config time, before
+/// any metadata IO has happened. The actual *which row groups* /
+/// *which rows* selection still needs to be deferred until the opener
+/// has the footer — that's why these fractions get carried through and
+/// applied lazily.
+///
+/// **Why no file-level fraction:** [`crate::source::ParquetSource`]
+/// doesn't own the file list — that lives on `FileScanConfig.file_groups`.
+/// Callers that want to drop files should rebuild the `FileScanConfig`
+/// with a reduced `file_groups`. Adding a file-fraction setter here
+/// would have been a no-op and confusing.
+///
+/// Selection within a row group is deterministic-but-random per
+/// `(file_index, row_group_index, fraction, cluster_size)`: we seed
+/// an `SmallRng` from a hash of those inputs so re-runs match exactly.
+/// The caller-supplied `file_index` is a stable per-file identifier
+/// (the parquet opener uses the execution `partition_index`), keeping
+/// sampling reproducible across environments without the keying
+/// depending on object-store paths.
+#[derive(Debug, Clone)]
+pub struct ParquetSampling {

Review Comment:
   Do we need to make it pub?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to