paleolimbot commented on code in PR #110: URL: https://github.com/apache/datafusion-java/pull/110#discussion_r3418012168
########## native-ffi/include/datafusion_scan.h: ########## @@ -0,0 +1,116 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Plain-C scan ABI over the Arrow C Data / C Stream interface. +// +// The only "rich" types crossing this boundary are the standard Arrow C +// structs `ArrowSchema` and `ArrowArrayStream` (from Arrow's abi.h), which any +// Arrow implementation can produce/consume. Everything else is C primitives +// and borrowed (ptr, len) views. No JVM/JNI types appear here, by design. + +#ifndef DATAFUSION_SCAN_H +#define DATAFUSION_SCAN_H + +#include <stddef.h> +#include <stdint.h> + +#include "arrow/c/abi.h" // struct ArrowSchema, struct ArrowArrayStream + +#ifdef __cplusplus +extern "C" { +#endif + +// --- Status codes ---------------------------------------------------------- +// 0 on success; nonzero classifies the failure. On error the call also writes +// a malloc'd, NUL-terminated message to *out_err (free with df_error_free). +typedef enum { + DF_OK = 0, + DF_INVALID_ARGUMENT = 1, + DF_UNKNOWN_PROVIDER = 2, + DF_PROVIDER_BUILD = 3, + DF_PLANNING = 4, + DF_EXECUTION = 5, + DF_PANIC = 6, + DF_INTERNAL = 7 +} DfStatus; + +// --- Borrowed input views (caller owns the memory) ------------------------- +typedef struct { + const uint8_t* ptr; // UTF-8, not NUL-terminated; may be null if len == 0 + size_t len; +} DfStr; + +typedef struct { + const uint8_t* ptr; // may be null if len == 0 + size_t len; +} DfBytes; + +typedef struct { + DfStr key; + DfStr value; +} DfKeyValue; + +// Opaque planned-scan handle. +typedef struct DfScanHandle DfScanHandle; + +// --- Lifecycle / versioning ------------------------------------------------ + +// ABI major version; compare before any other call. +uint64_t df_scan_abi_version(void); + +// Free a message previously written to an out_err argument (null-safe). +void df_error_free(char* err); + +// --- Scan API -------------------------------------------------------------- + +// Probe a provider's output schema into the caller-allocated out_schema. +int32_t df_scan_schema(DfStr provider, DfBytes options, DfBytes partition, + struct ArrowSchema* out_schema, char** out_err); + +// Plan a scan. On success writes an owned handle to *out_handle (release with +// df_scan_close). projection is an array of column-name DfStr (empty = all); +// filters is an array of serialized datafusion.LogicalExprNode DfBytes; +// target_partitions / batch_size <= 0 keep DataFusion defaults; limit < 0 means +// no row limit. +int32_t df_scan_create(DfStr provider, DfBytes options, DfBytes partition, + int32_t target_partitions, int32_t batch_size, int64_t limit, + const DfKeyValue* config_overrides, size_t config_overrides_len, + const DfStr* projection, size_t projection_len, + const DfBytes* filters, size_t filters_len, + DfScanHandle** out_handle, char** out_err); Review Comment: I think you can do this without the C symbols (just structs, like the Arrow C Data/Stream interfaces). The reason that's nice is that symbols can be hard to guarantee uniqueness (what if you have two Rust crates that want to export table provider and they're statically linked in the same pile of crates?). This is roughly how datafusion-python works with the generated FFI from datafusion-ffi. ```c struct DfScanHandle { int32_t (*partition_count)(const DfScanHandle* self); // -1 partition ID for coalesced stream int32_t (*execute)(const DfScanHandle* self, int32_t partition, struct ArrowArrayStream* out_stream); const char* (*get_last_error)(struct DfScanHandle* self); void* private_data; void (*release)(struct DfScanHandle* self); }; struct DfSimpleTableProvider { int32_t (*schema)(struct DFTableProvider* self, struct ArrowSchema* out_schema); int32_t (*scan_create)(DfStr provider, DfBytes options, DfBytes partition, int32_t target_partitions, int32_t batch_size, int64_t limit, const DfKeyValue* config_overrides, size_t config_overrides_len, const DfStr* projection, size_t projection_len, const DfBytes* filters, size_t filters_len, DfScanHandle** out_handle); const char* (*get_last_error)(struct DFTableProvider* self); void* private_data; void (*release)(struct DFTableProvider* self); }; ``` For Spark, a TableProvider exporter could define an entrypoint and build a cdylib. The entrypoint would have to have a common signature (like how `AdbcInitFunc` works). ```c int32_t MyProviderInit(void* table_provider, const char* name, int32_t df_abi_version); ``` ...but Python packages that export one of these could just work with capsules. ########## native-jni/src/lib.rs: ########## @@ -0,0 +1,238 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Thin JNI shim over the plain-C scan core (`datafusion-scan-ffi`). +//! +//! This is the JVM's path to the scan ABI. It is deliberately minimal: it +//! marshals Java arguments (a `String` provider name and two `byte[]` blobs) +//! into the in-process scan core, hands back an opaque handle as a `jlong`, +//! and -- for the data plane -- writes a standard `FFI_ArrowArrayStream` (or +//! `FFI_ArrowSchema`) into the address arrow-java allocated. **No Arrow data +//! crosses the JNI boundary**: batches flow through the Arrow C Stream +//! interface, which arrow-java imports with `Data.importArrayStream`. +//! Review Comment: Mostly just highlighting for future me the key pieces here...this is the JNI piece. I'm biased because I don't know JNI but I do know Arrow, so the fact that this version only has a few hundred lines of JNI appeals to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
