[GitHub] [iceberg] Fokko commented on a diff in pull request #7921: Initial code for pyiceberg JDBC Catalog

via GitHub Thu, 06 Jul 2023 09:52:59 -0700


Fokko commented on code in PR #7921:
URL: https://github.com/apache/iceberg/pull/7921#discussion_r1254632388



##########
python/pyiceberg/catalog/__init__.py:
##########
@@ -137,6 +145,10 @@ def infer_catalog_type(name: str, catalog_properties: 
RecursiveDict) -> Optional
                 return CatalogType.REST
             elif uri.startswith("thrift"):
                 return CatalogType.HIVE
+            elif uri.startswith("postgresql"):
+                return CatalogType.JDBC
+            elif uri.startswith("file"):

Review Comment:
   I would also be a bit hesitant with sqlite, since PyIceberg can run multiple 
processes that access the same database, and this is not supported.



##########
python/pyiceberg/catalog/jdbc.py:
##########
@@ -0,0 +1,452 @@
+import sqlite3
+from typing import (
+    Any,
+    Dict,
+    List,
+    Optional,
+    Set,
+    Union,
+)
+from urllib.parse import urlparse
+
+import psycopg2 as db
+from psycopg2.errors import UniqueViolation
+from psycopg2.extras import DictCursor
+
+from pyiceberg.catalog import (
+    METADATA_LOCATION,
+    PREVIOUS_METADATA_LOCATION,
+    Catalog,
+    Identifier,
+    Properties,
+    PropertiesUpdateSummary,
+)
+from pyiceberg.exceptions import (
+    NamespaceAlreadyExistsError,
+    NamespaceNotEmptyError,
+    NoSuchNamespaceError,
+    NoSuchPropertyException,
+    NoSuchTableError,
+    TableAlreadyExistsError,
+)
+from pyiceberg.io import load_file_io
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.serializers import FromInputFile
+from pyiceberg.table import Table
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.typedef import EMPTY_DICT
+
+JDBC_URI = "uri"
+
+# Catalog tables
+CATALOG_TABLE_NAME = "iceberg_tables"
+CATALOG_NAME = "catalog_name"
+TABLE_NAMESPACE = "table_namespace"
+TABLE_NAME = "table_name"

Review Comment:
   I think we should embed them. If we want to make them configurable, then we 
should do this through the config. I agree that it is better to embed them for 
now.



##########
python/pyiceberg/catalog/__init__.py:
##########
@@ -75,6 +75,7 @@ class CatalogType(Enum):
     HIVE = "hive"
     GLUE = "glue"
     DYNAMODB = "dynamodb"
+    JDBC = "jdbc"

Review Comment:
   That's a good point. I would go for SQL. I think we can mention JDBC in the 
docs to help users to understand that it is the same thing.



##########
python/pyproject.toml:
##########
@@ -61,6 +61,7 @@ thrift = { version = ">=0.13.0,<1.0.0", optional = true }
 boto3 = { version = ">=1.17.106", optional = true }
 s3fs = { version = ">=2021.08.0,<2024.1.0", optional = true } # Upper bound 
set arbitrarily, to be reassessed in early 2024.
 adlfs = { version = ">=2021.07.0,<2024.1.0", optional = true } # Upper bound 
set arbitrarily, to be reassessed in early 2024.
+psycopg2-binary = { version = ">=2.9.6", optional = true }

Review Comment:
   Yes, so there are multiple things here. The `DB-API` spec should be able to 
abstract the different databases, while SQLAlchemy is a full ORM layer.
   
   Personally, I'm not a huge fan of ORM layers in general because it takes the 
common denominator, and you won't be able to optimize (SQLAlchemy allows you to 
do this to some extent). Also, with Airflow we experienced some memory 
pressure, but I won't expect PyIceberg to have many open and concurrent 
connections.
   
   I think @cccs-eric also noticed that while libraries implement the `DB_API` 
specification, there are still subtle differences:
   
   - [Postgres prepared statements](https://www.psycopg.org/docs/usage.html): 
`INSERT INTO some_table (an_int, a_date, a_string) VALUES (%s, %s, %s);`.
   - [SQLite prepared 
statements](https://docs.python.org/3/library/sqlite3.html): `INSERT INTO movie 
VALUES(?, ?, ?)`. 
   
   I don't think not-using prepared statements is an option due to security 
considerations. Therefore I would lean to switching to `SQLAlchemy`. Another 
nice feature is that it also supports [generating the `CREATE TABLE` 
statements](https://www.tutorialspoint.com/sqlalchemy/sqlalchemy_core_creating_table.htm).
 This also makes it easier to add engines, and run tests against them.



##########
python/pyiceberg/catalog/__init__.py:
##########
@@ -137,6 +145,10 @@ def infer_catalog_type(name: str, catalog_properties: 
RecursiveDict) -> Optional
                 return CatalogType.REST
             elif uri.startswith("thrift"):
                 return CatalogType.HIVE
+            elif uri.startswith("postgresql"):
+                return CatalogType.JDBC
+            elif uri.startswith("file"):

Review Comment:
   Looking at the [URLs that SQL 
Alchemy](https://docs.sqlalchemy.org/en/20/core/engines.html#database-urls) we 
can limit it to Postgres for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] Fokko commented on a diff in pull request #7921: Initial code for pyiceberg JDBC Catalog

Reply via email to