Re: [Rust] [DISCUSS] Donate DataFusion to Arrow project

2019-01-06 Thread Ted Dunning
Cool! On Sun, Jan 6, 2019 at 1:52 PM Andy Grove wrote: > I'm starting a new thread for this discussion (this was previously > discussed in the Rust Roadmap thread). > > The reason I got involved with Arrow is that I have been working on > DataFusion[1] which is currently an in-process SQL quer

Re: Algorithmic explorations of bitmaps vs. sentinel values

2018-10-17 Thread Ted Dunning
Memory binding can be viewed as opportunity for melding multiple aggregators. For instance, any additional aggregation comes nearly for free. Sum and count (non zero) will be the same as either alone. Or sum and sum of squares. On Wed, Oct 17, 2018, 06:21 Francois Saint-Jacques < fsaintjacq..

Re: Algorithmic explorations of bitmaps vs. sentinel values

2018-10-16 Thread Ted Dunning
It should be possible to unroll the sentinel version in many cases. For instance, sum += (data[i] == SENTINEL) * data[i] This doesn't work with NaN as a sentinel because 0 * NaN => NaN, but it can work with other values. On Tue, Oct 16, 2018 at 9:38 AM Antoine Pitrou wrote: > > Hi Wes, >

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Ted Dunning
The effect of rename can be had by handling a small inventory file that is updated atomically. Having real file semantics is sooo much nicer, though. On Wed, Sep 19, 2018 at 1:51 PM Bill Glennon wrote: > Also, may want to take a look at https://aws.amazon.com/athena/. > > Thanks, > Bill > > O

Re: [DISCUSS] Dropping support for CentOS 5 / RHEL5 in Python packages

2018-09-04 Thread Ted Dunning
Just as a point of reference, I don't think that get any pushback at MapR for not supporting RHEL 5 and that has been our policy for a few years now. That experience should be pretty similar for Arrow, except that I would expect that new adoptions might be even more canted towards current versions

Re: Increasing transparency of corporate support for Apache Arrow development

2018-08-16 Thread Ted Dunning
Yes, there are several such examples. And it turned into a monstrous mess with companies bragging over lines of code changed. Oddly, the guys who did lots of reformatting did really well. There is also the problem of the very strong Apache tradition that it is individuals who contribute to project

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-30 Thread Ted Dunning
On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney wrote: > > > The community will be less willing to accept large > > changes that require multiple rounds of patches for stability and API > > convergence. Our contributions to Libhdfs++ in the HDFS community took a > > significantly long time for the v

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Ted Dunning
Simba Nice summary. I think that there may be some issues with your tests. In particular, you are storing essentially uniform random values. That might be a viable test in some situations, there are many where there is considerably less entropy in the data being stored. For instance, if you store

Re: Say no to zero length batches...

2017-04-14 Thread Ted Dunning
Speaking as a relative outsider, having the boundary cases for a transfer protocol be MORE restrictive than the senders and receivers is asking for boundary bugs. In this case, both the senders and receiver think that the boundary is 0 (empty lists, empty data frames, 0 results from a database). H

Re: Java-C++ integration tests -- on the home stretch

2016-11-21 Thread Ted Dunning
Wes, This is awesome. Does it, however, imply that to run the tests that a C programmer will need a working Java environment and a Java programmer will need a C environment? Is there any way around that? Possibly by storing golden bits for the in-memory images somewhere? On Mon, Nov 21, 201

Re: Can someone help me how should I start using Arrow Java Jars ?

2016-08-01 Thread Ted Dunning
Cloning the git repository gives you the full source code. On Mon, Aug 1, 2016 at 8:11 AM, Sanjay Rao wrote: > Hi Kiril, > Thanks a lot for your reply, Can I have the full source code ? It would > help me, also could you help me with Java doc link if any as such. > Thanks again,Sanjay > > > Fr

Re: Code review tools for Arrow patches

2016-04-24 Thread Ted Dunning
Just for the record, Apex had some issues getting Gerrit reviews reflected in a coherent fashion into the Apache record. I presume that you guys will have that handled or will check with the Apex devs to learn their resolution. On Sun, Apr 24, 2016 at 5:30 PM, Wes McKinney wrote: > Todd, woul

Re: Understanding "shared" memory implications

2016-03-18 Thread Ted Dunning
On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau wrote: > How do others feel of my redefinition of IPC to mean the same memory space > communication (either via shared memory or rdma) versus RPC as socket based > communication? > IPC already has a strong definition which is close to what you wan

Re: Code reviews / commit-then-review?

2016-03-03 Thread Ted Dunning
It's not like you are going to break an existing release. On Thu, Mar 3, 2016 at 3:11 PM, Julien Le Dem wrote: > sounds good. > > On Thu, Mar 3, 2016 at 1:17 PM, Jason Altekruse > wrote: > > > +1 > > > > On Thu, Mar 3, 2016 at 12:58 PM, Jacques Nadeau > > wrote: > > > > > +1. Sounds good to

Re: pandas and Apache Arrow in context

2016-02-22 Thread Ted Dunning
Put that answer on the front page of the web site. Well said. On Mon, Feb 22, 2016 at 2:05 PM, Wes McKinney wrote: > hi Stuart, > > Currently pandas and NumPy only support flat, non-nested data. Nested > data includes column value types including arrays, structs, maps, and > unions. This enab