Kimahriman commented on issue #5638:
URL: https://github.com/apache/arrow-rs/issues/5638#issuecomment-2061002935

   > 2. Second approach is to write native rust hdfs library and I believe 
@Kimahriman https://github.com/Kimahriman/hdfs-native is on the right track. I 
haven't use the library and cant tell how performant it is but IMHO it looks 
he's on the right track.
   
   Thanks for the call out! I agree there's no need to have HDFS support 
directly in this repo since the trait is public and it's a tricky thing to 
support. I actually have an object_store implementation on top of my library 
already 
https://github.com/Kimahriman/hdfs-native/tree/master/crates/hdfs-native-object-store.
 
   
   I've gotten pretty far with it at this point. I have some benchmarks that 
show reading/writing is at least on-par with the libhdfs based client, and RPC 
calls are even faster. I suspect performance would be even better in real 
scenarios, since the JVM client heavily makes use of multi-threading, which 
would help single-task benchmarks compared to my async setup.
   
   The only major feature I'm tracking that is not supported right now is file 
encryption support via KMS. Not sure how widely that is used or not. The other 
limitations right now are
   - It dynamically links to `libgssapi_krb5` native lib (via the `libgssapi` 
crate), which makes cross compiling tricky/impossible with Kerberos support. I 
know there are other libs (like compression libraries) that I think use their 
native implementation, so I'd be curious how those work for cross compiling 
(compiled and statically linked instead of dynamically linked?).
   - Reading and writing data isn't quite as resilient to failures as the Java 
client right now. Reading was a bit of an oversight I'm trying to fix now, 
writing is more complicated so it's currently just a "retry the whole thing if 
it fails" setup
   
   It's also not super heavily battle tested in various HDFS setups, but I 
haven't heard much yet of things not working for the few people who might be 
using it.
   
   I've been meaning to try to get it integrated with `delta-rs`, but haven't 
gotten around to it since ideally I want it included in the Python wheels, but 
the libgssapi thing has had me stuck for a while.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to