I also agree with simple, particularly at this stage. I think we could
always go back and do the more efficient vectorized mapping at a later
point. Todd, to your point about it being simple, in my experience, Python
developers take many forms, often times you will find those that really
don't like to code and are just there for the capabilities of tools like
pandas. I think helper functions like this are good.
Back to the approach, I think this is fair for now, pyspark does the same
"less efficient" mapping,
Greg, did you also intend to provide mapping from Pandas -> Kudu? Also, I
would take a look at maybe implementing this at a scanner level too, I
think this could be useful for folks using the Scan Token API.
On Fri, Oct 14, 2016 at 2:12 AM Todd Lipcon <t...@cloudera.com> wrote:
> On Thu, Oct 13, 2016 at 11:01 PM, Greg Kocunik <g.kocu...@gmail.com>
> > Hello,
> > I would like to contribute pandas support in the python API.
> > There is a jira ticket <https://issues.apache.org/jira/browse/KUDU-1276>
> > regarding this however the level is quite technical and beyond my current
> > abilities.
> > I would like to get consensus if you are open to simpler solutions in the
> > interim.
> > To give you an idea, I was looking at doing something along the lines of:
> > import pandas as pd
> > scanner = table.scanner()
> > scanner.open()
> > data = scanner.read_all_tuples()
> > pd.DataFrame(data,
> > columns=table.schema.names).set_index(table.schema.primary_keys())
> > Please let me know if such solutions are welcome.
> I'm always in favor of simple, but one question: if it's that simple then
> what's the purpose of having the explicit support, versus asking people to
> write the simple snippet?
> Justin Birdsell probably has a good opinion here since he's way more active
> than I am on Python.
> Todd Lipcon
> Software Engineer, Cloudera