Re: [I] Profile a query using Microlog [iceberg-python]

via GitHub Fri, 26 Sep 2025 08:01:43 -0700


laffra commented on issue #2525:
URL: 
https://github.com/apache/iceberg-python/issues/2525#issuecomment-3339072546


   The problem with this benchmark is that it is a microbenchmark, so we are
   just profiling module loads times. Openai also noticed that. It would be
   interesting to see a benchmark with larger files or more files to focus on
   certain scenarios.
   
   On Fri, 26 Sept 2025, 15:45 bradcnovi, ***@***.***> wrote:
   
   > *bradcnovi* left a comment (apache/iceberg-python#2525)
   > 
<https://github.com/apache/iceberg-python/issues/2525#issuecomment-3338801386>
   > image.png (view on web)
   > 
<https://github.com/user-attachments/assets/99b2325d-c97b-4c08-b568-e49e9a8397e5>
 image.png
   > (view on web)
   > 
<https://github.com/user-attachments/assets/d9a418f6-dead-4792-a0c7-6aec44ba33cd>
   >
   > Using the Analysis feature to ping OpenAI on what improvements can be made:
   > ------------------------------
   >
   > Thanks for the trace. With only the import-time profile, I’ll focus on
   > startup/architecture and dependency choices that affect both design and
   > performance. If you can share the repo or a high-level module layout, I can
   > tighten this with concrete diffs.
   >
   > What the trace tells us
   >
   >    - tests/benchmark.test_benchmark import takes ~1.3s, dominated by
   >    imports.
   >    - pyarrow.parquet costs ~0.6s.
   >    - A separate parquet.core import costs another ~0.6s. This looks like
   >    a different package named parquet, not pyarrow.parquet. If 
unintentional,
   >    this is likely duplicate/competing Parquet engines.
   >    - fs import costs ~0.4s (likely PyFilesystem2). If you intended
   >    fsspec, you might be pulling both worlds.
   >    - numpy import is ~0.2s (normal).
   >    - importlib bootstrap external overhead shows multiple discrete loads,
   >    indicating many top-level imports.
   >
   > High-impact improvements
   >
   >    1. Make the package import “cheap”
   >
   >
   >    - Goal: package import < 100–200 ms; defer heavy deps to call sites.
   >    - Move heavy imports (pyarrow, parquet, fs/fsspec, numpy) inside
   >    functions/methods that use them.
   >    - Use typing.TYPE_CHECKING for type-only imports to avoid runtime
   >    imports.
   >    - Keep *init*.py minimal: export symbols, avoid any I/O, logging
   >    setup, or configuration parsing at import time.
   >    - If you expose a CLI, keep the entry point very thin and import heavy
   >    submodules only after parsing arguments (so –help is instant).
   >
   >
   >    2. Resolve Parquet engine duplication
   >
   >
   >    - Your trace shows both pyarrow.parquet and parquet.core importing.
   >    Unless you intentionally support multiple engines, this is a red flag.
   >    - Pick one default engine (pyarrow or fastparquet). If supporting
   >    multiple, implement a plugin/strategy pattern:
   >       - Provide an IO facade (e.g., io.read_parquet / io.write_parquet).
   >       - Dynamically select engine via parameter, config, or environment.
   >       - Only import the selected engine inside the function body.
   >       - Make other engines optional extras (pip install
   >       brad_pydata2025[fastparquet]).
   >    - Add a test that fails if both are imported during a simple import of
   >    your top-level package.
   >
   >
   >    3. Filesystem abstraction hygiene
   >
   >
   >    - If you only need fsspec, avoid importing fs (PyFilesystem2). If you
   >    support both, wrap them behind a thin “FS adapter” interface and import 
on
   >    demand based on the URL scheme or config.
   >    - Avoid importing the cloud backends at module import. Let fsspec
   >    lazily resolve protocol handlers (e.g., s3fs/gcsfs) when encountering an
   >    s3:// or gs:// path.
   >
   >
   >    4. Benchmark isolation
   >
   >
   >    - Ensure benchmarks don’t include import time in the inner timings.
   >    Use pytest-benchmark or asv; pre-import modules in a setup phase so
   >    measurements focus on the target function.
   >    - Provide warm-up iterations and cold/warm cache scenarios explicitly.
   >    - Store benchmark data and environment info so regressions are
   >    attributable.
   >
   >
   >    5. Enforce an import-time budget in CI
   >
   >
   >    - Add a test using python -X importtime or importtime/import_profiler
   >    to assert the package import stays under a threshold on CI hardware.
   >    - Track import-time regressions over PRs.
   >
   >
   >    6. Lazy submodule loading
   >
   >
   >    - Consider lazy_loader (or a small custom *getattr* in *init*.py) to
   >    expose submodules without importing them upfront. This works well for 
large
   >    optional areas like io, datasets, and plugins.
   >
   >
   >    7. Configuration and logging at runtime, not import time
   >
   >
   >    - Don’t parse env vars or read config files at import time. Provide an
   >    explicit load_config() call or initialize on first use.
   >    - Avoid logging.basicConfig at import; leave logging configuration to
   >    the application/CLI entry point.
   >
   >
   >    8. Packaging and dependency footprint
   >
   >
   >    - Only declare hard requirements that are universally needed. Put IO
   >    and engine choices behind extras_require to reduce baseline import 
weight.
   >    - If you need a version string, use importlib.metadata.version("pkg")
   >    in a function, or inject a constant at build-time; avoid pkg_resources
   >    (slow).
   >    - Prefer small utility deps over “kitchen sink” helpers to keep
   >    transitive import cost low.
   >
   > PyArrow-specific guidance
   >
   >    - If you only need Parquet metadata quickly: use
   >    pyarrow.parquet.read_metadata or ParquetFile(path).metadata; this avoids
   >    constructing a dataset or scanning files.
   >    - Avoid pandas roundtrips. Operate on Arrow Tables/RecordBatches where
   >    possible to keep zero-copy and reduce overhead.
   >    - For large scans, use datasets with filtering/predicate pushdown, and
   >    set use_threads=True (default) if CPU parallelism helps. For many small
   >    files, use fsspec caching or file coalescing to reduce overhead.
   >    - If you sometimes only write Parquet, defer importing read-related
   >    modules to keep import lighter.
   >
   > API design suggestions
   >
   >    - Define a stable core API that is independent of any specific engine
   >    or filesystem. Example layers:
   >       - core/: pure logic and data models; no heavy deps.
   >       - io/: adapters for parquet, csv, etc.; each engine in a submodule
   >       with on-demand import.
   >       - fs/: adapters for fileystems (fsspec, PyFilesystem2) with a small
   >       registry and on-demand import.
   >       - cli/: argument parsing and thin orchestration. No heavy import in
   >       module scope.
   >    - Use protocols or abstract base classes to define engine and
   >    filesystem contracts. This keeps the core testable without heavy deps.
   >    - Keep data models with dataclasses or pydantic v2 (if needed). If
   >    using pydantic v1, consider migrating to v2 for speed and import time
   >    benefits.
   >
   > Testing and quality gates
   >
   >    - tests/test_import_time.py: asserts cold import under a budget.
   >    - tests/test_optional_deps.py: verifies no ImportError on core package
   >    without extras installed.
   >    - tests/test_engine_selection.py: ensures only the chosen engine gets
   >    imported (inspect sys.modules).
   >    - Benchmark suite with fixed datasets and asv or pytest-benchmark.
   >
   > Concrete quick wins you can do today
   >
   >    - Move import pyarrow and import parquet (the non-pyarrow one) out of
   >    module top levels into the functions that use them.
   >    - Remove/paraphrase from *init*.py any logic besides symbol exports.
   >    - Decide on a single default Parquet engine and put the other behind
   >    an optional extra.
   >    - Replace from X import Y at top level across wide modules with local
   >    imports at use sites, especially for pyarrow, parquet, fs/fsspec, 
pandas,
   >    and numpy.
   >    - Add a small import-time CI test and a benchmark that pre-imports the
   >    module in setup.
   >
   > If you share your repo structure (package layout and a few key modules),
   > I’ll propose a concrete module graph, show where to move imports, and
   > outline a minimal plugin registry for Parquet engines and filesystems with
   > code snippets.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > 
<https://github.com/apache/iceberg-python/issues/2525#issuecomment-3338801386>,
   > or unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AE4XAKAVWQOY23EP6REILLD3UU7Q3AVCNFSM6AAAAACHQYJ3CWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMZYHAYDCMZYGY>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Profile a query using Microlog [iceberg-python]

Reply via email to