[ 
https://issues.apache.org/jira/browse/HDFS-9758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260423#comment-15260423
 ] 

James Clampffer commented on HDFS-9758:
---------------------------------------

Here's a bunch of my thoughts about this, let me know what you think.  I 
haven't done much in Python 3.x so some of my assumptions might not hold true 
there.

My thinking was to focus on supporting CPython via CTypes, at least initially.  
I have a patch where I hacked together a demo of how this could be done that 
I'll dig up and post later today (doesn't support iterable files or readline() 
and isn't optimized but otherwise works well enough).  My overall opinion about 
this is that we should make it as easy to access HDFS through python as 
possible so less configuration and fewer dependencies is really important to 
get people to use it.  Naturally if some minor amount of configurations leads 
to a huge performance boost than it's worth considering.

I think CPython is the best place to focus simply because of it's ubiquity. 
PyPy is a cool project but doesn't come installed by default on many linux 
distributions as far as I know.  CPython ships with CTypes so that's one less 
dependency to bring in (unless CFFI is also included as a default library), but 
as you said you're pretty much stuck writing C wrapper functions for 
everything.  I don't think that's a dealbreaker as forcing a C API walls off 
exceptions and things that shouldn't be getting into the interpreter anyway.  
Does Cython get you a whole lot of benefits over something like CTypes?  I 
don't have experience with it.

Boost.Python or a pure python extension would mostly likely be the cleanest and 
most performant way of doing this sort of thing at the expense of extra 
complexity.  I've also heard that hadoop and boost generally don't mix but 
we've already made an exception for boost::asio (maybe that's different because 
it's header only?).  The only concern I'd have with both would be that they tie 
the module to the libhdfs++ C++ ABI so we'd have to be careful about 
compatibility.  I could see writing a module being a big benefit because then 
we could hook into the GC to properly support garbage collected async 
operations.

I think it's important to make sure at least some this work can help implement 
bindings for other languages but I think most approaches would do that in one 
way or another.  I'm partial to building language specific wrappers over the C 
API just because most scripting languages have a way of calling C functions.

> libhdfs++: Implement Python bindings
> ------------------------------------
>
>                 Key: HDFS-9758
>                 URL: https://issues.apache.org/jira/browse/HDFS-9758
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>
> It'd be really useful to have bindings for various scripting languages.  
> Python would be a good start because of it's popularity and how easy it is to 
> interact with shared libraries using the ctypes module.  I think bindings for 
> the V8 engine that nodeJS uses would be a close second in terms of expanding 
> the potential user base.
> Probably worth starting with just adding a synchronous API and building from 
> there to avoid interactions with python's garbage collector until the 
> bindings prove to be solid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to