skrawcz commented on code in PR #1376: URL: https://github.com/apache/hamilton/pull/1376#discussion_r2328349888
########## hamilton-core/README.md: ########## @@ -0,0 +1,41 @@ +# Read carefully + +> Use at your own risk + +This directory contains code for the package `sf-hamilton-core`. It is a drop-in replacement of `sf-hamilton`, with two changes: +- disable plugin autoloading +- make `pandas` and `numpy` optional dependencies; and remove `networkx` dependency (currently unused). + +This makes the Hamilton package a much lighter install and solves long library loading time. + +## As a user +If you want to try `sf-hamilton-core`, you need to: +1. Remove your current Hamilton installation: `pip uninstall sf-hamilton` +2. Install Hamilton core `pip install sf-hamilton-core` +3. Check installation `pip list` should only include `sf-hamilton-core`. + +This will install a different Python package with the name `hamilton` with the smaller dependencies and plugin autoloading disabled. + +It should be a drop-in replacement and your existing Hamilton code should just work. Though, if you're relying on plugins (e.g., parquet materializers, dataframe result builders), you will need to manually load them. + + +## How does it work + + +## Why is another package `sf-hamilton` necessary +This exists to prevent backwards incompatible changes for people who `pip install sf-hamilton` and use it in production. It is a temporary solution until a major release `sf-hamilton==2.0.0` could allow breaking changes and a more robust solution. + +### Disable plugin autoloading +Hamilton has generous number of plugins (`pandas`, `polars`, `mlflow`, `spark`). To give a good user experience, Hamilton autoloads plugins based on the available Python libraries in the current Python environment. For example, `to.mlflow()` becomes available if `mlflow` is installed. Autoloaded features notably include materializers like `from_.parquet` and `to.parquet` and data validators (pydantic, pandera, etc.) + +The issue with this approach is that Python environment with a lot of dependencies, common in data science, can be very slow to start because of all the imports. Currently, Hamilton allows to disable autoloading via a user config or Python code. This require manual setups and is not the best default for some users. + +### `pandas` and `numpy` dependencies +Hamilton was initially created for workflows that used `pandas` and `numpy` heavily. For this reason, `numpy` and `pandas` are imported at the top-level of module `hamilton.base`. Because of the package structure, as a Hamilton user, you're importing `pandas` and `numpy` every time you import `hamilton`. + +A reasonable change would be to move `numpy` and `pandas` to a "lazy" location. Then, dependencies would only be imported when features requiring them are used and they could be removed from `pyproject.toml`. Unfortunately, plugin autoloading defaults make this solution a significant breaking change and insatisfactory. Review Comment: ```suggestion A reasonable change would be to move `numpy` and `pandas` to a "lazy" location. Then, dependencies would only be imported when features requiring them are used and they could be removed from `pyproject.toml`. Unfortunately, plugin autoloading defaults make this solution a significant breaking change and unsatisfactory. ``` ########## hamilton/base.py: ########## @@ -20,21 +20,22 @@ It cannot import hamilton.graph, or hamilton.driver. """ +from __future__ import annotations + import abc import collections import logging -from typing import Any, Dict, List, Optional, Tuple, Type, Union - -import numpy as np -import pandas as pd -from pandas.core.indexes import extension as pd_extension +from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Type, Union +from hamilton import htypes from hamilton.lifecycle import api as lifecycle_api -try: - from . import htypes, node -except ImportError: - import node +if TYPE_CHECKING: Review Comment: comment as to importance of this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
