Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"

2019-07-13 Thread Micah Kornfield
SGTM could you or another PMC member start one?

Thanks,
Micah

On Saturday, July 13, 2019, Wes McKinney  wrote:

> Micah -- I would suggest that -- absent more opinions -- we vote about
> adopting the versioning scheme you described here (Format Version and
> Library Version)
>
> On Wed, Jul 10, 2019 at 8:46 AM Wes McKinney  wrote:
> >
> > On Wed, Jul 10, 2019 at 12:43 AM Micah Kornfield 
> wrote:
> > >
> > > Hi Eric,
> > > Short answer: I think your understanding matches what I was
> proposing.  Longer answer below.
> > >
> > >> So, for example, we release library v1.0.0 in a few months and then
> library v2.0.0 a few months after that.  In v2.0.0, C++, Python, and Java
> didn't make any breaking API changes from 1.0.0. But C# made 3 API breaking
> changes. This would be acceptable?
> > >
> > > Yes.  I think all language bindings are going under rapid enough
> iteration that we are making at least a few small breaking API changes on
> each release even though we try to avoid it.  I think it will be worth
> having further discussions on the release process once at least a few
> languages get to a more stable point.
> > >
> >
> > I agree with this. I think we are a pretty long ways away from making
> > API stability _guarantees_ in any of the implementations, though we
> > certainly should try to be courteous about the changes we do make, to
> > allow for graceful transitions over a period of 1-2 releases if
> > possible.
> >
> > > Thanks,
> > > Micah
> > >
> > > On Tue, Jul 9, 2019 at 2:26 PM Eric Erhardt <
> eric.erha...@microsoft.com> wrote:
> > >>
> > >> Just to be sure I fully understand the proposal:
> > >>
> > >> For the Library Version, we are going to increment the MAJOR version
> on every normal release, and increment the MINOR version if we need to
> release a patch/bug fix type of release.
> > >>
> > >> Since SemVer allows for API breaking changes on MAJOR versions, this
> basically means, each library (C++, Python, C#, Java, etc) _can_ introduce
> API breaking changes on every normal release (like we have been with the
> 0.x.0 releases).
> > >>
> > >> So, for example, we release library v1.0.0 in a few months and then
> library v2.0.0 a few months after that.  In v2.0.0, C++, Python, and Java
> didn't make any breaking API changes from 1.0.0. But C# made 3 API breaking
> changes. This would be acceptable?
> > >>
> > >> If my understanding above is correct, then I think this is a good
> plan. Initially I was concerned that the C# library wouldn't be free to
> make API breaking changes with making the version `1.0.0`. The C# library
> is still pretty inadequate, and I have a feeling there are a few things
> that will need to change about it in the future. But with the above plan,
> this concern won't be a problem.
> > >>
> > >> Eric
> > >>
> > >> -Original Message-
> > >> From: Micah Kornfield 
> > >> Sent: Monday, July 1, 2019 10:02 PM
> > >> To: Wes McKinney 
> > >> Cc: dev@arrow.apache.org
> > >> Subject: Re: [Discuss] Compatibility Guarantees and Versioning Post
> "1.0.0"
> > >>
> > >> Hi Wes,
> > >> Thanks for your response.  In regards to the protocol negotiation
> your description of feature reporting (snipped below) is along the lines of
> what I was thinking.  It might not be necessary for 1.0.0, but at some
> point might become useful.
> > >>
> > >>
> > >> >  Note that we don't really have a mechanism for clients and servers
> to
> > >> > report to each other what features they support, so this could help
> > >> > with that when for applications where it might matter.
> > >>
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >>
> > >> On Mon, Jul 1, 2019 at 12:54 PM Wes McKinney 
> wrote:
> > >>
> > >> > hi Micah,
> > >> >
> > >> > Sorry for the delay in feedback. I looked at the document and it
> seems
> > >> > like a reasonable perspective about forward- and
> > >> > backward-compatibility.
> > >> >
> > >> > It seems like the main thing you are proposing is to apply Semantic
> > >> > Versioning to Format and Library versions separately. That's an
> > >> > interesting idea, my thought had been to have a version number that
> is
> > >> > FORMAT_VERSION.LIBRARY_VERSION.PATCH_VERSION. But your proposal is
> > >> > more flexible in some ways, so let me clarify for others reading
> > >> >
> > >> > In what you are proposing, the next release would be:
> > >> >
> > >> > Format version: 1.0.0
> > >> > Library version: 1.0.0
> > >> >
> > >> > Suppose that 20 major versions down the road we stand at
> > >> >
> > >> > Format version: 1.5.0
> > >> > Library version: 20.0.0
> > >> >
> > >> > The minor version of the Format would indicate that there are
> > >> > additions, like new elements in the Type union, but otherwise
> backward
> > >> > and forward compatible. So the Minor version means "new things, but
> > >> > old clients will not be disrupted if those new things are not used".
> > >> > We've already been doing this since the V4 Format iteration but we
> > >> > have not had a way to 

[jira] [Created] (ARROW-5946) [Rust] [DataFusion] Projection push down with aggregate producing incorrect results

2019-07-13 Thread Andy Grove (JIRA)
Andy Grove created ARROW-5946:
-

 Summary: [Rust] [DataFusion] Projection push down with aggregate 
producing incorrect results
 Key: ARROW-5946
 URL: https://issues.apache.org/jira/browse/ARROW-5946
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Affects Versions: 0.14.0
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 1.0.0


I was testing some queries with the 0.14 release and noticed that the projected 
schema for a table scan is completely wrong (however the results of the query 
are not necessarily wrong)

 
{code:java}
// schema for nyxtaxi csv files
let schema = Schema::new(vec![
Field::new("VendorID", DataType::Utf8, true),
Field::new("tpep_pickup_datetime", DataType::Utf8, true),
Field::new("tpep_dropoff_datetime", DataType::Utf8, true),
Field::new("passenger_count", DataType::Utf8, true),
Field::new("trip_distance", DataType::Float64, true),
Field::new("RatecodeID", DataType::Utf8, true),
Field::new("store_and_fwd_flag", DataType::Utf8, true),
Field::new("PULocationID", DataType::Utf8, true),
Field::new("DOLocationID", DataType::Utf8, true),
Field::new("payment_type", DataType::Utf8, true),
Field::new("fare_amount", DataType::Float64, true),
Field::new("extra", DataType::Float64, true),
Field::new("mta_tax", DataType::Float64, true),
Field::new("tip_amount", DataType::Float64, true),
Field::new("tolls_amount", DataType::Float64, true),
Field::new("improvement_surcharge", DataType::Float64, true),
Field::new("total_amount", DataType::Float64, true),
]);

let mut ctx = ExecutionContext::new();
ctx.register_csv("tripdata", "file.csv", , true);

let optimized_plan = ctx.create_logical_plan(
"SELECT passenger_count, MIN(fare_amount), MAX(fare_amount) \
FROM tripdata GROUP BY passenger_count").unwrap();{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"

2019-07-13 Thread Wes McKinney
Micah -- I would suggest that -- absent more opinions -- we vote about
adopting the versioning scheme you described here (Format Version and
Library Version)

On Wed, Jul 10, 2019 at 8:46 AM Wes McKinney  wrote:
>
> On Wed, Jul 10, 2019 at 12:43 AM Micah Kornfield  
> wrote:
> >
> > Hi Eric,
> > Short answer: I think your understanding matches what I was proposing.  
> > Longer answer below.
> >
> >> So, for example, we release library v1.0.0 in a few months and then 
> >> library v2.0.0 a few months after that.  In v2.0.0, C++, Python, and Java 
> >> didn't make any breaking API changes from 1.0.0. But C# made 3 API 
> >> breaking changes. This would be acceptable?
> >
> > Yes.  I think all language bindings are going under rapid enough iteration 
> > that we are making at least a few small breaking API changes on each 
> > release even though we try to avoid it.  I think it will be worth having 
> > further discussions on the release process once at least a few languages 
> > get to a more stable point.
> >
>
> I agree with this. I think we are a pretty long ways away from making
> API stability _guarantees_ in any of the implementations, though we
> certainly should try to be courteous about the changes we do make, to
> allow for graceful transitions over a period of 1-2 releases if
> possible.
>
> > Thanks,
> > Micah
> >
> > On Tue, Jul 9, 2019 at 2:26 PM Eric Erhardt  
> > wrote:
> >>
> >> Just to be sure I fully understand the proposal:
> >>
> >> For the Library Version, we are going to increment the MAJOR version on 
> >> every normal release, and increment the MINOR version if we need to 
> >> release a patch/bug fix type of release.
> >>
> >> Since SemVer allows for API breaking changes on MAJOR versions, this 
> >> basically means, each library (C++, Python, C#, Java, etc) _can_ introduce 
> >> API breaking changes on every normal release (like we have been with the 
> >> 0.x.0 releases).
> >>
> >> So, for example, we release library v1.0.0 in a few months and then 
> >> library v2.0.0 a few months after that.  In v2.0.0, C++, Python, and Java 
> >> didn't make any breaking API changes from 1.0.0. But C# made 3 API 
> >> breaking changes. This would be acceptable?
> >>
> >> If my understanding above is correct, then I think this is a good plan. 
> >> Initially I was concerned that the C# library wouldn't be free to make API 
> >> breaking changes with making the version `1.0.0`. The C# library is still 
> >> pretty inadequate, and I have a feeling there are a few things that will 
> >> need to change about it in the future. But with the above plan, this 
> >> concern won't be a problem.
> >>
> >> Eric
> >>
> >> -Original Message-
> >> From: Micah Kornfield 
> >> Sent: Monday, July 1, 2019 10:02 PM
> >> To: Wes McKinney 
> >> Cc: dev@arrow.apache.org
> >> Subject: Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"
> >>
> >> Hi Wes,
> >> Thanks for your response.  In regards to the protocol negotiation your 
> >> description of feature reporting (snipped below) is along the lines of 
> >> what I was thinking.  It might not be necessary for 1.0.0, but at some 
> >> point might become useful.
> >>
> >>
> >> >  Note that we don't really have a mechanism for clients and servers to
> >> > report to each other what features they support, so this could help
> >> > with that when for applications where it might matter.
> >>
> >>
> >> Thanks,
> >> Micah
> >>
> >>
> >> On Mon, Jul 1, 2019 at 12:54 PM Wes McKinney  wrote:
> >>
> >> > hi Micah,
> >> >
> >> > Sorry for the delay in feedback. I looked at the document and it seems
> >> > like a reasonable perspective about forward- and
> >> > backward-compatibility.
> >> >
> >> > It seems like the main thing you are proposing is to apply Semantic
> >> > Versioning to Format and Library versions separately. That's an
> >> > interesting idea, my thought had been to have a version number that is
> >> > FORMAT_VERSION.LIBRARY_VERSION.PATCH_VERSION. But your proposal is
> >> > more flexible in some ways, so let me clarify for others reading
> >> >
> >> > In what you are proposing, the next release would be:
> >> >
> >> > Format version: 1.0.0
> >> > Library version: 1.0.0
> >> >
> >> > Suppose that 20 major versions down the road we stand at
> >> >
> >> > Format version: 1.5.0
> >> > Library version: 20.0.0
> >> >
> >> > The minor version of the Format would indicate that there are
> >> > additions, like new elements in the Type union, but otherwise backward
> >> > and forward compatible. So the Minor version means "new things, but
> >> > old clients will not be disrupted if those new things are not used".
> >> > We've already been doing this since the V4 Format iteration but we
> >> > have not had a way to signal that there may be new features. As a
> >> > corollary to this, I wonder if we should create a dual version in the
> >> > metadata
> >> >
> >> > PROTOCOL VERSION: (what is currently MetadataVersion, V2) FEATURE
> >> > VERSION: not 

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-13 Thread Wes McKinney
On Sat, Jul 13, 2019 at 12:57 PM Wes McKinney  wrote:
>
> OK, that's been merged and updated. Here's a Crossbow build
>

https://github.com/ursa-labs/crossbow/branches/all?utf8=%E2%9C%93=build-665

I'll keep an eye on CI. Anything else I can do to help get an RC out
please let me know


Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-13 Thread Wes McKinney
Sorry, spoke too soon, https://github.com/apache/arrow/pull/4856 is
the last patch to go in, I'm reviewing that now

On Sat, Jul 13, 2019 at 12:06 PM Wes McKinney  wrote:
>
> Thanks Kou.
>
> I've updated the patch release script [1], pushed the maint-0.14.x
> branch [2], and just submitted a Crossbow packaging run [3]
>
> If all looks good, I think this branch can be used to create an RC
>
> [1]: https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
> [2]: https://github.com/apache/arrow/tree/maint-0.14.x
> [3]: 
> https://github.com/ursa-labs/crossbow/branches/all?utf8=%E2%9C%93=build-664
>
> On Fri, Jul 12, 2019 at 5:22 PM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > I've created pull requests that were used to release 0.14.0:
> >
> > ARROW-5937: [Release] Stop parallel binary upload
> > https://github.com/apache/arrow/pull/4868
> >
> > ARROW-5938: [Release] Create branch for adding release note automatically
> > https://github.com/apache/arrow/pull/4869
> >
> > ARROW-5939: [Release] Add support for generating vote email template 
> > separately
> > https://github.com/apache/arrow/pull/4870
> >
> > ARROW-5940: [Release] Add support for re-uploading sign/checksum for binary 
> > artifacts
> > https://github.com/apache/arrow/pull/4871
> >
> > ARROW-5941: [Release] Avoid re-uploading already uploaded binary artifacts
> > https://github.com/apache/arrow/pull/4872
> > (This will be conflicted with https://github.com/apache/arrow/pull/4868 .)
> >
> >
> > They will be useful to release 0.14.1.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, 
> > Parquet forward compatibility problems" on Fri, 12 Jul 2019 13:27:41 -0500,
> >   Wes McKinney  wrote:
> >
> > > I updated https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
> > > to include all the cited patches, as well as the Parquet forward
> > > compatibility fix.
> > >
> > > I'm waiting on CI to be able to pass ARROW-5921 (fuzzing-discovered
> > > IPC crash) and the ARROW-5889 (Parquet backwards compatibility with
> > > 0.13) needs to be rebased
> > >
> > > https://github.com/apache/arrow/pull/4856
> > >
> > > I think those are the last 2 patches that should go into the branch
> > > unless something else comes up. Once those land I'll update the
> > > commands and then push up the patch release branch (hopefully
> > > everything will cherry pick cleanly)
> > >
> > > On Fri, Jul 12, 2019 at 12:34 PM Francois Saint-Jacques
> > >  wrote:
> > >>
> > >> There's also ARROW-5921 (I tagged it 0.14.1) if it passes travis. This
> > >> one fixes a segfault found via fuzzing.
> > >>
> > >> François
> > >>
> > >> On Fri, Jul 12, 2019 at 6:54 AM Krisztián Szűcs
> > >>  wrote:
> > >> >
> > >> > PRs touching the wheel packaging scripts:
> > >> > - https://github.com/apache/arrow/pull/4828 (lz4)
> > >> > - https://github.com/apache/arrow/pull/4833 (uriparser - only if
> > >> > https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a
> > >> > is cherry picked as well)
> > >> > - https://github.com/apache/arrow/pull/4834 (zlib)
> > >> >
> > >> > On Fri, Jul 12, 2019 at 11:49 AM Hatem Helal  
> > >> > wrote:
> > >> >
> > >> > > Thanks François, I closed PARQUET-1623 this morning.  It would be 
> > >> > > nice to
> > >> > > include the PR in the patch release:
> > >> > >
> > >> > > https://github.com/apache/arrow/pull/4857
> > >> > >
> > >> > > This bug has been around for a few releases but I think it should be 
> > >> > > a low
> > >> > > risk change to include.
> > >> > >
> > >> > > Hatem
> > >> > >
> > >> > >
> > >> > > On 7/12/19, 2:27 AM, "Francois Saint-Jacques" 
> > >> > > 
> > >> > > wrote:
> > >> > >
> > >> > > I just merged PARQUET-1623, I think it's worth inserting since it
> > >> > > fixes an invalid memory write. Note that I couldn't 
> > >> > > resolve/close the
> > >> > > parquet issue, do I have to be contributor to the project?
> > >> > >
> > >> > > François
> > >> > >
> > >> > > On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney 
> > >> > > 
> > >> > > wrote:
> > >> > > >
> > >> > > > I just merged Eric's 2nd patch ARROW-5908 and I went through 
> > >> > > all the
> > >> > > > patches since the release commit and have come up with the 
> > >> > > following
> > >> > > > list of 32 fix-only patches to pick into a maintenance branch:
> > >> > > >
> > >> > > > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
> > >> > > >
> > >> > > > Note there's still unresolved Parquet forward/backward 
> > >> > > compatibility
> > >> > > > issues in C++ that we haven't merged patches for yet, so that 
> > >> > > is
> > >> > > > pending.
> > >> > > >
> > >> > > > Are there any other patches / JIRA issues people would like to 
> > >> > > see
> > >> > > > resolved in a patch release?
> > >> > > >
> > >> > > > Thanks
> > >> > > >
> > >> > > > On Thu, Jul 11, 2019 at 3:03 PM Wes 

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-13 Thread Wes McKinney
Thanks Kou.

I've updated the patch release script [1], pushed the maint-0.14.x
branch [2], and just submitted a Crossbow packaging run [3]

If all looks good, I think this branch can be used to create an RC

[1]: https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
[2]: https://github.com/apache/arrow/tree/maint-0.14.x
[3]: 
https://github.com/ursa-labs/crossbow/branches/all?utf8=%E2%9C%93=build-664

On Fri, Jul 12, 2019 at 5:22 PM Sutou Kouhei  wrote:
>
> Hi,
>
> I've created pull requests that were used to release 0.14.0:
>
> ARROW-5937: [Release] Stop parallel binary upload
> https://github.com/apache/arrow/pull/4868
>
> ARROW-5938: [Release] Create branch for adding release note automatically
> https://github.com/apache/arrow/pull/4869
>
> ARROW-5939: [Release] Add support for generating vote email template 
> separately
> https://github.com/apache/arrow/pull/4870
>
> ARROW-5940: [Release] Add support for re-uploading sign/checksum for binary 
> artifacts
> https://github.com/apache/arrow/pull/4871
>
> ARROW-5941: [Release] Avoid re-uploading already uploaded binary artifacts
> https://github.com/apache/arrow/pull/4872
> (This will be conflicted with https://github.com/apache/arrow/pull/4868 .)
>
>
> They will be useful to release 0.14.1.
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, 
> Parquet forward compatibility problems" on Fri, 12 Jul 2019 13:27:41 -0500,
>   Wes McKinney  wrote:
>
> > I updated https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
> > to include all the cited patches, as well as the Parquet forward
> > compatibility fix.
> >
> > I'm waiting on CI to be able to pass ARROW-5921 (fuzzing-discovered
> > IPC crash) and the ARROW-5889 (Parquet backwards compatibility with
> > 0.13) needs to be rebased
> >
> > https://github.com/apache/arrow/pull/4856
> >
> > I think those are the last 2 patches that should go into the branch
> > unless something else comes up. Once those land I'll update the
> > commands and then push up the patch release branch (hopefully
> > everything will cherry pick cleanly)
> >
> > On Fri, Jul 12, 2019 at 12:34 PM Francois Saint-Jacques
> >  wrote:
> >>
> >> There's also ARROW-5921 (I tagged it 0.14.1) if it passes travis. This
> >> one fixes a segfault found via fuzzing.
> >>
> >> François
> >>
> >> On Fri, Jul 12, 2019 at 6:54 AM Krisztián Szűcs
> >>  wrote:
> >> >
> >> > PRs touching the wheel packaging scripts:
> >> > - https://github.com/apache/arrow/pull/4828 (lz4)
> >> > - https://github.com/apache/arrow/pull/4833 (uriparser - only if
> >> > https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a
> >> > is cherry picked as well)
> >> > - https://github.com/apache/arrow/pull/4834 (zlib)
> >> >
> >> > On Fri, Jul 12, 2019 at 11:49 AM Hatem Helal  
> >> > wrote:
> >> >
> >> > > Thanks François, I closed PARQUET-1623 this morning.  It would be nice 
> >> > > to
> >> > > include the PR in the patch release:
> >> > >
> >> > > https://github.com/apache/arrow/pull/4857
> >> > >
> >> > > This bug has been around for a few releases but I think it should be a 
> >> > > low
> >> > > risk change to include.
> >> > >
> >> > > Hatem
> >> > >
> >> > >
> >> > > On 7/12/19, 2:27 AM, "Francois Saint-Jacques" 
> >> > > 
> >> > > wrote:
> >> > >
> >> > > I just merged PARQUET-1623, I think it's worth inserting since it
> >> > > fixes an invalid memory write. Note that I couldn't resolve/close 
> >> > > the
> >> > > parquet issue, do I have to be contributor to the project?
> >> > >
> >> > > François
> >> > >
> >> > > On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney 
> >> > > wrote:
> >> > > >
> >> > > > I just merged Eric's 2nd patch ARROW-5908 and I went through all 
> >> > > the
> >> > > > patches since the release commit and have come up with the 
> >> > > following
> >> > > > list of 32 fix-only patches to pick into a maintenance branch:
> >> > > >
> >> > > > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
> >> > > >
> >> > > > Note there's still unresolved Parquet forward/backward 
> >> > > compatibility
> >> > > > issues in C++ that we haven't merged patches for yet, so that is
> >> > > > pending.
> >> > > >
> >> > > > Are there any other patches / JIRA issues people would like to 
> >> > > see
> >> > > > resolved in a patch release?
> >> > > >
> >> > > > Thanks
> >> > > >
> >> > > > On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney 
> >> > > 
> >> > > wrote:
> >> > > > >
> >> > > > > Eric -- you are free to set the Fix Version prior to the patch
> >> > > being merged
> >> > > > >
> >> > > > > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt
> >> > > > >  wrote:
> >> > > > > >
> >> > > > > > The two C# fixes I'd like in the 0.14.1 release are:
> >> > > > > >
> >> > > > > > https://issues.apache.org/jira/browse/ARROW-5887 - already
> >> > > 

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-13 Thread Wes McKinney
On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou  wrote:
>
> On Fri, 12 Jul 2019 20:37:15 -0700
> Micah Kornfield  wrote:
> >
> > If the latter, I wonder why Parquet cannot simply be used instead of
> > > reinventing something similar but different.
> >
> > This is a reasonable point.  However there is  continuum here between file
> > size and read and write times.  Parquet will likely always be the smallest
> > with the largest times to convert to and from Arrow.  An uncompressed
> > Feather/Arrow file will likely always take the most space but will much
> > faster conversion times.
>
> I'm curious whether the Parquet conversion times are inherent to the
> Parquet format or due to inefficiencies in the implementation.
>

Parquet is fundamentally more complex to decode. Consider several
layers of logic that must happen for values to end up in the right
place

* Data pages are usually compressed, and a column consists of many
data pages each having a Thrift header that must be deserialized
* Values are usually dictionary-encoded, dictionary indices are
encoded using hybrid bit-packed / RLE scheme
* Null/not-null is encoded in definition levels
* Only non-null values are stored, so when decoding to Arrow, values
have to be "moved into place"

The current C++ implementation could certainly be made faster. One
consideration with Parquet is that the files are much smaller, so when
you are reading them over the network the effective end-to-end time
including IO and deserialization will frequently win.

> Regards
>
> Antoine.
>
>


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-13 Thread Antoine Pitrou
On Fri, 12 Jul 2019 20:37:15 -0700
Micah Kornfield  wrote:
> 
> If the latter, I wonder why Parquet cannot simply be used instead of
> > reinventing something similar but different.  
> 
> This is a reasonable point.  However there is  continuum here between file
> size and read and write times.  Parquet will likely always be the smallest
> with the largest times to convert to and from Arrow.  An uncompressed
> Feather/Arrow file will likely always take the most space but will much
> faster conversion times.

I'm curious whether the Parquet conversion times are inherent to the
Parquet format or due to inefficiencies in the implementation.

Regards

Antoine.




Re: [DISCUSS] Release cadence and release vote conventions

2019-07-13 Thread Andy Grove
I would like to volunteer to help with Java and Rust release process work,
especially nightly releases.

Although I'm not that familiar with the Java implementation of Arrow, I
have been using Java and Maven for a very long time.

Do we envisage a single nightly release process that releases all languages
simultaneously? or do we want separate process per language, with different
maintainers?



On Wed, Jul 10, 2019 at 8:18 AM Wes McKinney  wrote:

> On Sun, Jul 7, 2019 at 7:40 PM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > > in future releases we should
> > > institute a minimum 24-hour "quiet period" after any community
> > > feedback on a release candidate to allow issues to be examined
> > > further.
> >
> > I agree with this. I'll do so when I do a release manager in
> > the future.
> >
> > > To be able to release more often, two things have to happen:
> > >
> > > * More PMC members must engage with the release management role,
> > > process, and tools
> > > * Continued improvements to release tooling to make the process less
> > > painful for the release manager. For example, it seems we may want to
> > > find a different place than Bintray to host binary artifacts
> > > temporarily during release votes
> >
> > My opinion that we need to build nightly release system.
> >
> > It uses dev/release/NN-*.sh to build .tar.gz and binary
> > artifacts from the .tar.gz.
> > It also uses dev/release/verify-release-candidate.* to
> > verify build .tar.gz and binary artifacts.
> > It also uses dev/release/post-NN-*.sh to do post release
> > tasks. (Some tasks such as uploading a package to packaging
> > system will be dry-run.)
> >
>
> I agree that having a turn-key release system that's capable of
> producing nightly packages is the way to do. That way any problems
> that would block a release will come up as they happen rather than
> piling up until the very end like they are now.
>
> > I needed 10 or more changes for dev/release/ to create
> > 0.14.0 RC0. (Some of them are still in my local stashes. I
> > don't have time to create pull requests for them
> > yet. Because I postponed some tasks of my main
> > business. I'll create pull requests after I finished the
> > postponed tasks of my main business.)
> >
>
> Thanks. I'll follow up on the 0.14.1/0.15.0 thread -- since we need to
> release again soon because of problems with 0.14.0 please let us know
> what patches will be needed to make another release.
>
> > If we fix problems related to dev/release/ in our normal
> > development process, release process will be less painful.
> >
> > The biggest problem for 0.14.0 RC0 is java/pom.xml related:
> >   https://github.com/apache/arrow/pull/4717
> >
> > It was difficult for me because I don't have Java
> > knowledge. Release manager needs help from many developers
> > because release manager may not have knowledge of all
> > supported languages. Apache Arrow supports 10 over
> > languages.
> >
> >
> > For Bintray API limit problem, we'll be able to resolve it.
> > I was added to https://bintray.com/apache/ members:
> >
> >   https://issues.apache.org/jira/browse/INFRA-18698
> >
> > I'll be able to use Bintray API without limitation in the
> > future. Release managers should also request the same thing.
> >
>
> This is good, I will add myself. Other PMC members should also add
> themselves.
>
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "[DISCUSS] Release cadence and release vote conventions" on Sat, 6 Jul
> 2019 16:28:50 -0500,
> >   Wes McKinney  wrote:
> >
> > > hi folks,
> > >
> > > As a reminder, particularly since we have many new community members
> > > (some of whom have never been involved with an ASF project before),
> > > releases are approved exclusively by the PMC and in general releases
> > > cannot be vetoed. In spite of that, we strive to make releases that
> > > have unanimous (either by explicit +1 or lazy consent) support of the
> > > PMC. So it is better to have unanimous 5 +1 votes than 6 +1 votes with
> > > a -1 dissenting vote.
> > >
> > > On the 0.14.0 vote, as with previous release votes, some issues with
> > > the release were raised by members of the community, whether build or
> > > test-related problems or other failures. Technically speaking, such
> > > issues have no _direct_ bearing on whether a release vote passes, only
> > > on whether PMC members vote +1, 0, or -1. A PMC member is allowed to
> > > change their vote based on new information -- for example, if I voted
> > > +1 on a release and then someone reported a serious licensing issue,
> > > then I would revise my vote to -1.
> > >
> > > On the RC0 vote thread, Jacques wrote [1]
> > >
> > > "A release vote should last until we arrive at consensus. When an
> > > issue is potentially identified, those that have voted should be given
> > > ample time to change their vote and others that may have been lazy
> > > consenters should be given time to chime in. There is no maximum
> > > amount of time a vote can be open. Allowing 

[jira] [Created] (ARROW-5945) [Rust] [DataFusion] Table trait should support building complete queries

2019-07-13 Thread Andy Grove (JIRA)
Andy Grove created ARROW-5945:
-

 Summary: [Rust] [DataFusion] Table trait should support building 
complete queries
 Key: ARROW-5945
 URL: https://issues.apache.org/jira/browse/ARROW-5945
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust, Rust - DataFusion
Affects Versions: 0.14.0
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 1.0.0


DataFusion 0.13 included a preview Table trait, which provides a DataFrame 
style method of building a logical query plan, but it was not usable for any 
real-world queries.

I would now like the trait to support building real queries, especially 
aggregate queries.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5944) Remove 'div' alias for 'divide'

2019-07-13 Thread Prudhvi Porandla (JIRA)
Prudhvi Porandla created ARROW-5944:
---

 Summary: Remove 'div' alias for 'divide' 
 Key: ARROW-5944
 URL: https://issues.apache.org/jira/browse/ARROW-5944
 Project: Apache Arrow
  Issue Type: Task
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla


div and divide are two different operators.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)