As far as Spark/Spark SQL is concerned: there is now a  very nice high 
level API that's in the process of being open sourced called "distributed 
dataframe" http://ddf.io/. It's java based but also has R & Python 
interfaces. You could wrap that via JavaCall or via PyCall.

On Monday, September 8, 2014 1:32:28 AM UTC+2, Steven Sagaert wrote:
>
>
>
> On Sunday, September 7, 2014 7:28:18 PM UTC+2, Harlan Harris wrote:
>>
>> This was a feature that sorta existed for a while (see 
>> https://github.com/JuliaStats/DataFrames.jl/issues/24 ), but nobody was 
>> very happy with it, and I think John ripped it out as part of one of his 
>> simplification passes. It's tricky to think about how best to implement 
>> this sort of feature when you aspirationally want to support memory-mapped 
>> and distributed structures too,
>>
> I was more thinking along the lines of a simple in-memory db. If you want 
> out-of-memory & distributed it's probably best to interface systems like 
> Spark SQL or Scidb rather than develop that yourselves from scratch. Maybe 
> write something in the spirit of Blaze (blaze.pydata.org)? Right now 
> Blaze supports Spark but I was just discussing with them about scidb and 
> they are also looking into that.
>  
>
>> and where you want a semantics that's explicitly set-like, cf Pandas or 
>> R's data.tables. 
>>
> R's data.table is nice but unfortunately only supports just one index. 
>
>>
>> Also worth thinking about this in the context of John's just-announced 
>> goals: https://gist.github.com/johnmyleswhite/ad5305ecaa9de01e317e
>>
>>
>>
>> On Sun, Sep 7, 2014 at 12:54 PM, John Myles White <johnmyl...@gmail.com> 
>> wrote:
>>
>>> No, DataFrames are not indexed. For now, you’d need to build a wrapper 
>>> that indexes a DataFrame to get that kind of functionality.
>>>
>>>  — John
>>>
>>> On Sep 7, 2014, at 9:53 AM, Steven Sagaert <steven....@gmail.com> wrote:
>>>
>>> > Hi,
>>> > I was wondering if searching in a dataframe is indexed (in the DB 
>>> sense, not array sense. e.g. a tree index structure) or not? If so can you 
>>> have multiple indices (on multiple columns) or not?
>>>
>>>
>>

Reply via email to