Fokko commented on code in PR #7921:
URL: https://github.com/apache/iceberg/pull/7921#discussion_r1254632388
##########
python/pyiceberg/catalog/__init__.py:
##########
@@ -137,6 +145,10 @@ def infer_catalog_type(name: str, catalog_properties:
RecursiveDict) -> Optional
return CatalogType.REST
elif uri.startswith("thrift"):
return CatalogType.HIVE
+ elif uri.startswith("postgresql"):
+ return CatalogType.JDBC
+ elif uri.startswith("file"):
Review Comment:
I would also be a bit hesitant with sqlite, since PyIceberg can run multiple
processes that access the same database, and this is not supported.
##########
python/pyiceberg/catalog/jdbc.py:
##########
@@ -0,0 +1,452 @@
+import sqlite3
+from typing import (
+ Any,
+ Dict,
+ List,
+ Optional,
+ Set,
+ Union,
+)
+from urllib.parse import urlparse
+
+import psycopg2 as db
+from psycopg2.errors import UniqueViolation
+from psycopg2.extras import DictCursor
+
+from pyiceberg.catalog import (
+ METADATA_LOCATION,
+ PREVIOUS_METADATA_LOCATION,
+ Catalog,
+ Identifier,
+ Properties,
+ PropertiesUpdateSummary,
+)
+from pyiceberg.exceptions import (
+ NamespaceAlreadyExistsError,
+ NamespaceNotEmptyError,
+ NoSuchNamespaceError,
+ NoSuchPropertyException,
+ NoSuchTableError,
+ TableAlreadyExistsError,
+)
+from pyiceberg.io import load_file_io
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.serializers import FromInputFile
+from pyiceberg.table import Table
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.typedef import EMPTY_DICT
+
+JDBC_URI = "uri"
+
+# Catalog tables
+CATALOG_TABLE_NAME = "iceberg_tables"
+CATALOG_NAME = "catalog_name"
+TABLE_NAMESPACE = "table_namespace"
+TABLE_NAME = "table_name"
Review Comment:
I think we should embed them. If we want to make them configurable, then we
should do this through the config. I agree that it is better to embed them for
now.
##########
python/pyiceberg/catalog/__init__.py:
##########
@@ -75,6 +75,7 @@ class CatalogType(Enum):
HIVE = "hive"
GLUE = "glue"
DYNAMODB = "dynamodb"
+ JDBC = "jdbc"
Review Comment:
That's a good point. I would go for SQL. I think we can mention JDBC in the
docs to help users to understand that it is the same thing.
##########
python/pyproject.toml:
##########
@@ -61,6 +61,7 @@ thrift = { version = ">=0.13.0,<1.0.0", optional = true }
boto3 = { version = ">=1.17.106", optional = true }
s3fs = { version = ">=2021.08.0,<2024.1.0", optional = true } # Upper bound
set arbitrarily, to be reassessed in early 2024.
adlfs = { version = ">=2021.07.0,<2024.1.0", optional = true } # Upper bound
set arbitrarily, to be reassessed in early 2024.
+psycopg2-binary = { version = ">=2.9.6", optional = true }
Review Comment:
Yes, so there are multiple things here. The `DB-API` spec should be able to
abstract the different databases, while SQLAlchemy is a full ORM layer.
Personally, I'm not a huge fan of ORM layers in general because it takes the
common denominator, and you won't be able to optimize (SQLAlchemy allows you to
do this to some extent). Also, with Airflow we experienced some memory
pressure, but I won't expect PyIceberg to have many open and concurrent
connections.
I think @cccs-eric also noticed that while libraries implement the `DB_API`
specification, there are still subtle differences:
- [Postgres prepared statements](https://www.psycopg.org/docs/usage.html):
`INSERT INTO some_table (an_int, a_date, a_string) VALUES (%s, %s, %s);`.
- [SQLite prepared
statements](https://docs.python.org/3/library/sqlite3.html): `INSERT INTO movie
VALUES(?, ?, ?)`.
I don't think not-using prepared statements is an option due to security
considerations. Therefore I would lean to switching to `SQLAlchemy`. Another
nice feature is that it also supports [generating the `CREATE TABLE`
statements](https://www.tutorialspoint.com/sqlalchemy/sqlalchemy_core_creating_table.htm).
This also makes it easier to add engines, and run tests against them.
##########
python/pyiceberg/catalog/__init__.py:
##########
@@ -137,6 +145,10 @@ def infer_catalog_type(name: str, catalog_properties:
RecursiveDict) -> Optional
return CatalogType.REST
elif uri.startswith("thrift"):
return CatalogType.HIVE
+ elif uri.startswith("postgresql"):
+ return CatalogType.JDBC
+ elif uri.startswith("file"):
Review Comment:
Looking at the [URLs that SQL
Alchemy](https://docs.sqlalchemy.org/en/20/core/engines.html#database-urls) we
can limit it to Postgres for now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]