Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-07 Thread Micah Kornfield
Indeed, the BoundChecking and CheckNullForGet variables can make a big
difference.  I didn't initially run the benchmarks with these turned on
(you can see the result from above with Float8Benchmarks).  Here are new
numbers including with the flags enabled.  It looks like using longs might
be a little bit slower, I'll see what I can do to mitigate this.

Ravindra also volunteered to try to benchmark the changes with Dremio's
code on today's sync call.

New

BenchmarkMode  Cnt   Score   Error
Units

BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   4.176 ± 1.292
ns/op

BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  26.102 ± 0.700
ns/op

Float8Benchmarks.copyFromBenchmark   avgt5  7.398 ± 0.084  us/op

Float8Benchmarks.readWriteBenchmark  avgt5  2.711 ± 0.057  us/op



old

BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   3.828 ± 0.030
ns/op

BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  20.611 ± 0.188
ns/op

Float8Benchmarks.copyFromBenchmark   avgt5  6.597 ± 0.462  us/op

Float8Benchmarks.readWriteBenchmark  avgt5  2.615 ± 0.027  us/op

On Wed, Aug 7, 2019 at 7:13 PM Fan Liya  wrote:

> Hi Gonzalo,
>
> Thanks for sharing the performance results.
> I am wondering if you have turned off the flag
> BoundsChecking#BOUNDS_CHECKING_ENABLED.
> If not, the lower throughput should be expected.
>
> Best,
> Liya Fan
>
> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield 
> wrote:
>
>> Hi Gonzalo,
>> Thank you for the feedback.  I wasn't aware of the JIT implications.   At
>> least on the benchmark run they don't seem to have an impact.
>>
>> If there are other benchmarks that people have that can validate if this
>> change will be problematic I would appreciate trying to run them with the
>> PR.  I will try to run the ones for zeroing/popcnt tonight to see if there
>> is a change in those.
>>
>> -Micah
>>
>>
>>
>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>> golthir...@gmail.com> wrote:
>>
>> > I would recommend to take care with this kind of changes.
>> >
>> > I didn't try Arrow in more than one year, but by then the performance
>> was
>> > quite bad in comparison with plain byte buffer access
>> > (see http://git.net/apache-arrow-development/msg02353.html *) and
>> > there are several optimizations that the JVM (specifically, C2) does not
>> > apply when dealing with int instead of longs. One of the
>> > most commons is the loop unrolling and vectorization.
>> >
>> > * It doesn't seem the best way to reference an old email on the list,
>> but
>> > it is the only result shown by Google
>> >
>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya ()
>> > escribió:
>> >
>> >> Hi Micah,
>> >>
>> >> Thanks for your effort. The performance result looks good.
>> >>
>> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes for
>> each
>> >> of length, write index, and read index).
>> >> Similar overheads also exist for vectors like BaseFixedWidthVector,
>> >> BaseVariableWidthVector, etc.
>> >>
>> >> IMO, such overheads are small enough to justify the change.
>> >> Let's check if there are other overheads.
>> >>
>> >> Best,
>> >> Liya Fan
>> >>
>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield 
>> >> wrote:
>> >>
>> >> > Hi Liya Fan,
>> >> > Based on the Float8Benchmark there does not seem to be any meaningful
>> >> > performance difference on my machine.  At least for me, the
>> benchmarks
>> >> are
>> >> > not stable enough to say one is faster than the other (I've pasted
>> >> results
>> >> > below).  That being said my machine isn't necessarily the most
>> reliable
>> >> for
>> >> > benchmarking.
>> >> >
>> >> > On an intuitive level, this makes sense to me,  for the most part it
>> >> seems
>> >> > like the change just moves casting from "int" to "long" further up
>> the
>> >> > stack  for  "PlatformDepdendent" operations.  If there are other
>> >> benchmarks
>> >> > that you think are worth running let me know.
>> >> >
>> >> > One downside performance wise I think for his change is it increases
>> the
>> >> > size of ArrowBuf objects, which I suppose could influence cache
>> misses
>> >> at
>> >> > some level or increase the size of call-stacks, but this doesn't
>> seem to
>> >> > show up in the benchmark..
>> >> >
>> >> > Thanks,
>> >> > Micah
>> >> >
>> >> > Sample benchmark numbers:
>> >> >
>> >> > [New Code]
>> >> > BenchmarkMode  Cnt   Score   Error  Units
>> >> > Float8Benchmarks.copyFromBenchmark   avgt5  15.441 ± 0.469  us/op
>> >> > Float8Benchmarks.readWriteBenchmark  avgt5  14.057 ± 0.115  us/op
>> >> >
>> >> >
>> >> > [Old code]
>> >> > BenchmarkMode  Cnt   Score   Error  Units
>> >> > Float8Benchmarks.copyFromBenchmark   avgt5  16.248 ± 1.409  us/op
>> >> > Float8Benchmarks.readWriteBenchmark  avgt5  14.150 ± 0.084  us/op
>> >> >
>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya 
>> wrote:
>> >> >
>> >> >> Hi 

Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-07 Thread Fan Liya
Hi Gonzalo,

Thanks for sharing the performance results.
I am wondering if you have turned off the flag
BoundsChecking#BOUNDS_CHECKING_ENABLED.
If not, the lower throughput should be expected.

Best,
Liya Fan

On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield 
wrote:

> Hi Gonzalo,
> Thank you for the feedback.  I wasn't aware of the JIT implications.   At
> least on the benchmark run they don't seem to have an impact.
>
> If there are other benchmarks that people have that can validate if this
> change will be problematic I would appreciate trying to run them with the
> PR.  I will try to run the ones for zeroing/popcnt tonight to see if there
> is a change in those.
>
> -Micah
>
>
>
> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
> golthir...@gmail.com> wrote:
>
> > I would recommend to take care with this kind of changes.
> >
> > I didn't try Arrow in more than one year, but by then the performance was
> > quite bad in comparison with plain byte buffer access
> > (see http://git.net/apache-arrow-development/msg02353.html *) and
> > there are several optimizations that the JVM (specifically, C2) does not
> > apply when dealing with int instead of longs. One of the
> > most commons is the loop unrolling and vectorization.
> >
> > * It doesn't seem the best way to reference an old email on the list, but
> > it is the only result shown by Google
> >
> > El mié., 7 ago. 2019 a las 11:42, Fan Liya ()
> > escribió:
> >
> >> Hi Micah,
> >>
> >> Thanks for your effort. The performance result looks good.
> >>
> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes for
> each
> >> of length, write index, and read index).
> >> Similar overheads also exist for vectors like BaseFixedWidthVector,
> >> BaseVariableWidthVector, etc.
> >>
> >> IMO, such overheads are small enough to justify the change.
> >> Let's check if there are other overheads.
> >>
> >> Best,
> >> Liya Fan
> >>
> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield 
> >> wrote:
> >>
> >> > Hi Liya Fan,
> >> > Based on the Float8Benchmark there does not seem to be any meaningful
> >> > performance difference on my machine.  At least for me, the benchmarks
> >> are
> >> > not stable enough to say one is faster than the other (I've pasted
> >> results
> >> > below).  That being said my machine isn't necessarily the most
> reliable
> >> for
> >> > benchmarking.
> >> >
> >> > On an intuitive level, this makes sense to me,  for the most part it
> >> seems
> >> > like the change just moves casting from "int" to "long" further up the
> >> > stack  for  "PlatformDepdendent" operations.  If there are other
> >> benchmarks
> >> > that you think are worth running let me know.
> >> >
> >> > One downside performance wise I think for his change is it increases
> the
> >> > size of ArrowBuf objects, which I suppose could influence cache misses
> >> at
> >> > some level or increase the size of call-stacks, but this doesn't seem
> to
> >> > show up in the benchmark..
> >> >
> >> > Thanks,
> >> > Micah
> >> >
> >> > Sample benchmark numbers:
> >> >
> >> > [New Code]
> >> > BenchmarkMode  Cnt   Score   Error  Units
> >> > Float8Benchmarks.copyFromBenchmark   avgt5  15.441 ± 0.469  us/op
> >> > Float8Benchmarks.readWriteBenchmark  avgt5  14.057 ± 0.115  us/op
> >> >
> >> >
> >> > [Old code]
> >> > BenchmarkMode  Cnt   Score   Error  Units
> >> > Float8Benchmarks.copyFromBenchmark   avgt5  16.248 ± 1.409  us/op
> >> > Float8Benchmarks.readWriteBenchmark  avgt5  14.150 ± 0.084  us/op
> >> >
> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya  wrote:
> >> >
> >> >> Hi Micah,
> >> >>
> >> >> Thanks a lot for doing this.
> >> >>
> >> >> I am a little concerned about if there is any negative performance
> >> impact
> >> >> on the current 32-bit-length based applications.
> >> >> Can we do some performance comparison on our existing benchmarks?
> >> >>
> >> >> Best,
> >> >> Liya Fan
> >> >>
> >> >>
> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
> emkornfi...@gmail.com>
> >> >> wrote:
> >> >>
> >> >>> There have been some previous discussions on the mailing about
> >> supporting
> >> >>> 64-bit lengths for  Java ValueVectors (this is what the IPC
> >> specification
> >> >>> and C++ support).  I created a PR [1] that changes all APIs that I
> >> could
> >> >>> find that take an index to take an "long" instead of an "int" (and
> >> >>> similarly change "size/rowcount" APIs).
> >> >>>
> >> >>> It is a big change, so I think it is worth discussing if it is
> >> something
> >> >>> we
> >> >>> still want to move forward with.  It would be nice to come to a
> >> >>> conclusion
> >> >>> quickly, ideally in the next few days, to avoid a lot of merge
> >> conflicts.
> >> >>>
> >> >>> The reason I did this work now is the C++ implementation has added
> >> >>> support
> >> >>> for LargeList, LargeBinary and LargeString arrays and based on prior
> >> >>> discussions we need to have similar 

[jira] [Created] (ARROW-6167) [R] macOS binary R packages on CRAN don't have arrow_available

2019-08-07 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-6167:
--

 Summary: [R] macOS binary R packages on CRAN don't have 
arrow_available
 Key: ARROW-6167
 URL: https://issues.apache.org/jira/browse/ARROW-6167
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.14.1
Reporter: Neal Richardson
Assignee: Neal Richardson


The {{configure}} script in the R package has some 
[magic|https://github.com/apache/arrow/blob/master/r/configure#L66-L86] that 
should ensure that on macOS, you're guaranteed a successful library 
installation even (especially) if you don't have libarrow installed on your 
system. This magic also is designed so that when CRAN builds a binary package 
for macOS, the C++ libraries are bundled and "just work" when a user installs 
it, no compilation required. 

However, the magic appeared to fail on CRAN this time, as the binaries linked 
on [https://cran.r-project.org/web/packages/arrow/index.html] were built 
without libarrow ({{arrow::arrow_available()}} returns {{FALSE}}). 

I've identified three vectors by which you can get an arrow package 
installation on macOS in this state:
 # The [check|https://github.com/apache/arrow/blob/master/r/configure#L71] to 
see if you've already installed {{apache-arrow}} via Homebrew always passes, so 
if you have Homebrew installed but haven't done {{brew install apache-arrow}}, 
the script won't do it for you like it looks like it intends. (This is not 
suspected to be the problem on CRAN because they don't have Homebrew installed.)
 # If the 
"[autobrew|https://github.com/apache/arrow/blob/master/r/configure#L80-L81]; 
installation fails, then the [test on 
L102|https://github.com/apache/arrow/blob/master/r/configure#L102] will 
correctly fail. I managed to trigger this (by luck?) on the [R-hub testing 
service|https://builder.r-hub.io/status/arrow_0.14.1.tar.gz-da083126612b46e28854b95156b87b31#L533].
 This is possibly what happened on CRAN, though the only [build 
logs|https://www.r-project.org/nosvn/R.check/r-release-osx-x86_64/arrow-00check.html]
 we have from CRAN are terse because it believes the build was successful. 
 # Some idiosyncrasy in the compiler on the CRAN macOS system such that the 
autobrew script would successfully download the arrow libraries but the L102 
check would error. I've been unable to reproduce this using the [version of 
clang7 that CRAN provides|https://cran.r-project.org/bin/macosx/tools/].

I have a fix for the first one and will provide workaround documentation for 
the README and announcement blog post. Unfortunately, I don't know that there's 
anything we can do about the useless binaries on CRAN at this time, 
particularly since CRAN is going down for maintenance August 9-18.

cc [~jeroenooms] [~romainfrancois] [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6166) [Go] Slice of slice causes index out of range panic

2019-08-07 Thread Roshan Kumaraswamy (JIRA)
Roshan Kumaraswamy created ARROW-6166:
-

 Summary: [Go] Slice of slice causes index out of range panic
 Key: ARROW-6166
 URL: https://issues.apache.org/jira/browse/ARROW-6166
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Roshan Kumaraswamy


When slicing a slice, the offset of the underlying data will cause an index out 
of range panic if the offset if greater than the slice length. See 
[https://github.com/apache/arrow/issues/5033]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Options for running the integration tests

2019-08-07 Thread paddy horan
Hi All,

I have been away from Arrow for a while due to relocation of family and RSI.  
I'd like to start working toward getting Rust passing the integration tests.  
In the last few months a lot of work has been done to "dockerize" many of the 
build steps in the project, which I'm trying to figure out.

I started out using the 'arrow_integration_xenial_base' image and submitted a 
PR to allow it to be built from a windows host, but I noticed that there is a 
page in the pyarrow docs related to integration testing 
(https://arrow.apache.org/docs/developers/integration.html) that uses 
docker-compose from the top level of the project.  It seems that the 
'arrow_integration_xenial_base' image is replaced by this solution?

Is there a way to run the integration tests (integration_test.py) in a 
reproducible way via docker at this time?  If not I plan to add the 
dependencies for go and java etc to 'arrow_integration_xenial_base' so that I 
can run integration_test.py in a docker container.

Thanks,
Paddy

Integration Testing — Apache Arrow 
v0.13.0
# Build and run manually docker-compose build cpp docker-compose build python 
docker-compose run python # Using the makefile with proper image dependency 
resolution make -f Makefile.docker python
arrow.apache.org



[jira] [Created] (ARROW-6165) [Integration] Use multiprocessing to run integration tests on multiple CPU cores

2019-08-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6165:
---

 Summary: [Integration] Use multiprocessing to run integration 
tests on multiple CPU cores
 Key: ARROW-6165
 URL: https://issues.apache.org/jira/browse/ARROW-6165
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration
Reporter: Wes McKinney


The stdout/stderr will have to be captured appropriate so that the console 
output when run in parallel is still readable



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Aw: Re: [Discuss] C++ filenames: hyphens or underscores?

2019-08-07 Thread Antoine Pitrou


Some good points were made on the Arrow sync call.
To sum them up:

- looking through /usr/include on a Linux description, the vast majority
of filenames seem to user underscores, including in the C++ stdlib
- most our public C++ header files use underscores, so there's less
disruption to existing user code if we switch to underscores rather than
hyphens

So my current stance would be:
- switch to underscores for C++ source file names (*.h, *.cc files)
- ensure hyphens are used in executable file names (unit tests,
benchmarks, utilites) - perhaps using some CMake logic

Regards

Antoine.



Le 07/08/2019 à 11:19, hans-joachim.bo...@web.de a écrit :
> Hi,
> 
> I struggled long time if underscore or hyphen is better.
> The result was a mess, using both - never sure which one i actually used in a 
> certain case.
> 
> Today i am consistently using underscore in variable names and hyphens in 
> file names.
> - No more trial and error.
> - Compliant to Linux naming, as Kou rightly mentioned.
> - No chance to use hyphens in variable names - so i never forget which way it 
> is.
> 
> Hans
> 
> 
> 
>> Gesendet: Mittwoch, 07. August 2019 um 07:44 Uhr
>> Von: "Philipp Moritz" 
>> An: dev@arrow.apache.org, emkornfi...@gmail.com
>> Betreff: Re: [Discuss] C++ filenames: hyphens or underscores?
>>
>> I also have a small preference for underscores but would also be fine with
>> dashes.
>>
>> It seems to be more common (therefore blends better with vendored code) and
>> agrees with the styleguide and is closest to the exiting code. Also as an
>> aside, having file_names names like variable_names is nice. Compare the
>> Lispy way of using dashes for both.
>>
>> Thanks for getting this discussion started, the mixture of dashes and
>> underscores has been bothering me too :)
>>
>> On Tue, Aug 6, 2019 at 8:41 PM Micah Kornfield 
>> wrote:
>>
>>> I also have a preference for underscore but can get used to anything.
>>>
>>> I agree with the points François made above about the recommendation of the
>>> style guide and the smaller change to the existing code base.
>>>
>>> On Tue, Aug 6, 2019 at 6:52 PM Francois Saint-Jacques <
>>> fsaintjacq...@gmail.com> wrote:
>>>
 My vote would go with underscore to minimize changes and minimize
 exceptions to the google style guide reference. I also suggests that
 we add this to the linters somehow, if it's not too much trouble.

 François

 On Tue, Aug 6, 2019 at 9:35 PM Sutou Kouhei  wrote:
>
> Hi,
>
> I like hyphens.
>
> Because many Linux commands use hyphens than
> underscores. Here are counts on my Debian GNU/Linux machine:
>
> % ls /usr/bin/ | grep -- - | wc -l
> 956
> % ls /usr/bin/ | grep _ | wc -l
> 343
>
>
> Thanks,
> --
> kou
>
> In <20190806140340.2a7ffab2@fsol>
>   "[Discuss] C++ filenames: hyphens or underscores?" on Tue, 6 Aug 2019
 14:03:40 +0200,
>   Antoine Pitrou  wrote:
>
>>
>> Hello,
>>
>> The filenames in the C++ source tree are a bit ad hoc and
>>> inconsistent.
>> Sometimes they use hyphens for word separation, sometimes
>>> underscores.
>> In ARROW-4648 it was proposed that we unify C++ file naming,
>>> therefore
>> there are two possible options: only hyphens, or only underscores.
>>
>> What are your preferences?  Personally, I have a slight preference
>>> for
>> hyphens, especially as they are already used in binary names.
>>
>> Regards
>>
>> Antoine.
>>
>>

>>>
>>


[RESULT] [VOTE] Adopt FORMAT and LIBRARY SemVer-based version schemes for Arrow 1.0.0 and beyond

2019-08-07 Thread Wes McKinney
The motion carries with 5 binding +1 and 2 non-binding +1 and no other votes

We need to make sure this is properly documented in time for the 1.0.0 release

https://issues.apache.org/jira/browse/ARROW-6164

On Sun, Aug 4, 2019 at 5:04 PM Jacques Nadeau  wrote:
>
> Looks good. +1 from me. Thanks for driving this to conclusion.
>
> On Wed, Jul 31, 2019, 12:04 PM Bryan Cutler  wrote:
>
> > +1 (non-binding)
> >
> > On Wed, Jul 31, 2019 at 8:59 AM Uwe L. Korn  wrote:
> >
> > > +1 from me.
> > >
> > > I really like the separate versions
> > >
> > > Uwe
> > >
> > > On Tue, Jul 30, 2019, at 2:21 PM, Antoine Pitrou wrote:
> > > >
> > > > +1 from me.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> > > > On Fri, 26 Jul 2019 14:33:30 -0500
> > > > Wes McKinney  wrote:
> > > > > hello,
> > > > >
> > > > > As discussed on the mailing list thread [1], Micah Kornfield has
> > > > > proposed a version scheme for the project to take effect starting
> > with
> > > > > the 1.0.0 release. See document [2] containing a discussion of the
> > > > > issues involved.
> > > > >
> > > > > To summarize my understanding of the plan:
> > > > >
> > > > > 1. TWO VERSIONS: As of 1.0.0, we establish separate FORMAT and
> > LIBRARY
> > > > > versions. Currently there is only a single version number.
> > > > >
> > > > > 2. SEMANTIC VERSIONING: We follow https://semver.org/ with regards
> > to
> > > > > communicating library API changes. Given the project's pace of
> > > > > evolution, most releases are likely to be MAJOR releases according to
> > > > > SemVer principles.
> > > > >
> > > > > 3. RELEASES: Releases of the project will be named according to the
> > > > > LIBRARY version. A major release may or may not change the FORMAT
> > > > > version. When a LIBRARY version has been released for a new FORMAT
> > > > > version, the latter is considered to be released and official.
> > > > >
> > > > > 4. Each LIBRARY version will have a corresponding FORMAT version. For
> > > > > example, LIBRARY versions 2.0.0 and 3.0.0 may track FORMAT version
> > > > > 1.0.0. The idea is that FORMAT version will change less often than
> > > > > LIBRARY version.
> > > > >
> > > > > 5. BACKWARD COMPATIBILITY GUARANTEE: A newer versioned client library
> > > > > will be able to read any data and metadata produced by an older
> > client
> > > > > library.
> > > > >
> > > > > 6. FORWARD COMPATIBILITY GUARANTEE: An older client library must be
> > > > > able to either read data generated from a new client library or
> > detect
> > > > > that it cannot properly read the data.
> > > > >
> > > > > 7. FORMAT MINOR VERSIONS: An increase in the minor version of the
> > > > > FORMAT version, such as 1.0.0 to 1.1.0, indicates that 1.1.0 contains
> > > > > new features not available in 1.0.0. So long as these features are
> > not
> > > > > used (such as a new logical data type), forward compatibility is
> > > > > preserved.
> > > > >
> > > > > 8. FORMAT MAJOR VERSIONS: A change in the FORMAT major version
> > > > > indicates a disruption to these compatibility guarantees in some way.
> > > > > Hopefully we don't have to do this many times in our respective
> > > > > lifetimes
> > > > >
> > > > > If I've misrepresented some aspect of the proposal it's fine to
> > > > > discuss more and we can start a new votes.
> > > > >
> > > > > Please vote to approve this proposal. I'd like to keep this vote open
> > > > > for 7 days (until Friday August 2) to allow for ample opportunities
> > > > > for the community to have a look.
> > > > >
> > > > > [ ] +1 Adopt these version conventions and compatibility guarantees
> > as
> > > > > of Apache Arrow 1.0.0
> > > > > [ ] +0
> > > > > [ ] -1 I disagree because...
> > > > >
> > > > > Here is my vote: +1
> > > > >
> > > > > Thanks
> > > > > Wes
> > > > >
> > > > > [1]:
> > >
> > https://lists.apache.org/thread.html/5715a4d402c835d22d929a8069c5c0cf232077a660ee98639d544af8@%3Cdev.arrow.apache.org%3E
> > > > > [2]:
> > >
> > https://docs.google.com/document/d/1uBitWu57rDu85tNHn0NwstAbrlYqor9dPFg_7QaE-nc/edit#
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >


[jira] [Created] (ARROW-6164) [Docs][Format] Document project versioning schema and forward/backward compatibility policies

2019-08-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6164:
---

 Summary: [Docs][Format] Document project versioning schema and 
forward/backward compatibility policies
 Key: ARROW-6164
 URL: https://issues.apache.org/jira/browse/ARROW-6164
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney
 Fix For: 1.0.0


Based on policy adopted via vote on mailing list



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6163) [C++] Misnamed test

2019-08-07 Thread Dmitry Kalinkin (JIRA)
Dmitry Kalinkin created ARROW-6163:
--

 Summary: [C++] Misnamed test
 Key: ARROW-6163
 URL: https://issues.apache.org/jira/browse/ARROW-6163
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.14.1
Reporter: Dmitry Kalinkin


"arrow-dataset-file_test" defined in
https://github.com/apache/arrow/blob/49badd25804af85dfe9019ab1390c649a02c89fa/cpp/src/arrow/dataset/CMakeLists.txt#L49
the existing naming convention seems to be "foo-bar-test" and not 
"foo-bar_test". Test need to be renamed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements

2019-08-07 Thread Wes McKinney
I'm canceling the vote until we can refine the proposal a bit more.

I think that having an 8-byte EOS as 0x is OK -- I
think my concern about backwards compatibility is unwarranted.

We also previously added the EOS to the File Format

https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#file-format

I think we should formalize this decision with this vote. Note that
making the EOS 8 bytes has the effect of aligning the file footer,
which otherwise would cause possible undefined behavior in C++.

I'll wait a day or two for more opinions to percolate and then call a new vote

On Wed, Aug 7, 2019 at 9:35 AM Antoine Pitrou  wrote:
>
>
> Hmm... not sure about that.  IMHO, if the old format is detected, then a
> 4-byte EOS marker should be used.  If the new format is detected, then a
> 8-byte EOS marker should be used.
>
> Regards
>
> Antoine.
>
>
> Le 07/08/2019 à 16:16, Wes McKinney a écrit :
> > You make a good point. For backward compatibility reasons, bytes 5
> > through 8 would need to be unspecified padding bytes, I think. Does
> > that sound right?
> >
> > On Wed, Aug 7, 2019 at 9:02 AM Antoine Pitrou  wrote:
> >>
> >>
> >> This may be coming a bit late, but I realize we could take the
> >> opportunity to *also* make the end-of-stream marker a 8-bytes marker
> >> (rather than 4-bytes).  What do you think?
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 06/08/2019 à 22:15, Wes McKinney a écrit :
> >>> hi all,
> >>>
> >>> As we've been discussing for the last 5 weeks or so [1], there is a
> >>> need to introduce 4 bytes of padding into the preamble of the
> >>> "encapsulated IPC message" format to ensure that the Flatbuffers
> >>> metadata payload begins on an 8-byte aligned memory offset. The
> >>> alternative to this would be for Arrow implementations where alignment
> >>> is important (e.g. C or C++) to copy the metadata (which is not always
> >>> small) into memory when it is unaligned.
> >>>
> >>> Micah has proposed to address this by adding a 4-byte "continuation"
> >>> value at the beginning of the payload having the value 0x. The
> >>> reason to do it this way is that old clients will see an invalid
> >>> length (what is currently the first 4 bytes of the message -- a 32-bit
> >>> little endian signed integer indicating the metadata length) rather
> >>> than potentially crashing on a valid length.
> >>>
> >>> This would be a backwards incompatible protocol change, so older Arrow
> >>> libraries would not be able to read these new messages. Maintaining
> >>> forward compatibility (reading data produced by older libraries) would
> >>> be possible as we can reason that a value other than the continuation
> >>> value was produced by an older library (and then validate the
> >>> Flatbuffer message of course). Arrow implementations could offer a
> >>> backward compatibility mode for the sake of old readers if they desire
> >>> (this may also assist with testing).
> >>>
> >>> The PR making these changes to the IPC documentation is here
> >>>
> >>> https://github.com/apache/arrow/pull/4951
> >>>
> >>> Please vote to accept this change. This vote will be open for at least 72 
> >>> hours
> >>>
> >>> [ ] +1 Adopt the Arrow protocol change
> >>> [ ] +0
> >>> [ ] -1 I disagree because...
> >>>
> >>> Here is my vote: +1
> >>>
> >>> Thanks,
> >>> Wes
> >>>
> >>> [1]: 
> >>> https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E
> >>>


[jira] [Created] (ARROW-6162) [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len parameter is zero

2019-08-07 Thread Prudhvi Porandla (JIRA)
Prudhvi Porandla created ARROW-6162:
---

 Summary: [C++][Gandiva] Do not truncate string in 
castVARCHAR_varchar when out_len parameter is zero
 Key: ARROW-6162
 URL: https://issues.apache.org/jira/browse/ARROW-6162
 Project: Apache Arrow
  Issue Type: Task
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Arrow sync call tomorrow (August 7) at 12:00 US/Eastern, 16:00 UTC

2019-08-07 Thread Neal Richardson
Attendees:

Hatem Helal
Micah Kornfield
Wes McKinney
Rok Mihevc
Ravindra Pindikura
Antoine Pitrou
Neal Richardson
François Saint-Jacques


Discussion:

* Alignment vote: Antoine raised proposal to pad end of stream too.
Related need to make official an EOS marker. Decision to suspend vote
for now.
* 64-bit addressing in Java
(https://lists.apache.org/thread.html/dcd1ae4d943b40568e6b178b91cb7ed012168711f9eb2bbf3c3cbd2d@%3Cdev.arrow.apache.org%3E,
https://github.com/apache/arrow/pull/5020). Dremio to study the PR.
* Sparse proposal (https://github.com/apache/arrow/pull/4815). Discuss
need to get more feedback from PMC members, ways to get their
attention
  * Aside: discussion of organization of format documents. Wes said
he'd spend some time on that this month
* Plans for sparse tensors?
* Move static site source to arrow-site
(https://lists.apache.org/thread.html/73e8a7d2ee667e83371e8bb99c13338144c8af5444f55918994713be@%3Cdev.arrow.apache.org%3E)
Lazy consensus reached?
* Underscores vs. hyphens in C++ file names?
(https://lists.apache.org/thread.html/420cdf013b67c6ae02100d3e008a4acc984ede9316ab2745a452471e@%3Cdev.arrow.apache.org%3E)
* Informal discussion of 0.15.0 release timing, nightly builds

On Tue, Aug 6, 2019 at 9:29 AM Neal Richardson
 wrote:
>
> Hi all,
> Reminder that the biweekly Arrow call is tomorrow at
> https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes
> will be sent out to the mailing list afterwards.
>
> Neal


[jira] [Created] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures

2019-08-07 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6161:
-

 Summary: [C++] Implements dataset::ParquetFile and associated Scan 
structures
 Key: ARROW-6161
 URL: https://issues.apache.org/jira/browse/ARROW-6161
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6160) [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex child vectors

2019-08-07 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6160:
-

 Summary: [Java] AbstractStructVector#getPrimitiveVectors fails to 
work with complex child vectors
 Key: ARROW-6160
 URL: https://issues.apache.org/jira/browse/ARROW-6160
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently in {{AbstractStructVector#getPrimitiveVectors}}, only struct type 
child vectors will recursively get primitive vectors, other complex type like 
{{ListVector}}, {{UnionVector}} was treated as primitive type and return 
directly.

For example, Struct(List(Int), Struct(Int, Varchar)) {{getPrimitiveVectors}} 
should return {{[IntVector, IntVector, VarCharVector]}} instead of [ListVector, 
IntVector, VarCharVector]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements

2019-08-07 Thread Antoine Pitrou


Hmm... not sure about that.  IMHO, if the old format is detected, then a
4-byte EOS marker should be used.  If the new format is detected, then a
8-byte EOS marker should be used.

Regards

Antoine.


Le 07/08/2019 à 16:16, Wes McKinney a écrit :
> You make a good point. For backward compatibility reasons, bytes 5
> through 8 would need to be unspecified padding bytes, I think. Does
> that sound right?
> 
> On Wed, Aug 7, 2019 at 9:02 AM Antoine Pitrou  wrote:
>>
>>
>> This may be coming a bit late, but I realize we could take the
>> opportunity to *also* make the end-of-stream marker a 8-bytes marker
>> (rather than 4-bytes).  What do you think?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 06/08/2019 à 22:15, Wes McKinney a écrit :
>>> hi all,
>>>
>>> As we've been discussing for the last 5 weeks or so [1], there is a
>>> need to introduce 4 bytes of padding into the preamble of the
>>> "encapsulated IPC message" format to ensure that the Flatbuffers
>>> metadata payload begins on an 8-byte aligned memory offset. The
>>> alternative to this would be for Arrow implementations where alignment
>>> is important (e.g. C or C++) to copy the metadata (which is not always
>>> small) into memory when it is unaligned.
>>>
>>> Micah has proposed to address this by adding a 4-byte "continuation"
>>> value at the beginning of the payload having the value 0x. The
>>> reason to do it this way is that old clients will see an invalid
>>> length (what is currently the first 4 bytes of the message -- a 32-bit
>>> little endian signed integer indicating the metadata length) rather
>>> than potentially crashing on a valid length.
>>>
>>> This would be a backwards incompatible protocol change, so older Arrow
>>> libraries would not be able to read these new messages. Maintaining
>>> forward compatibility (reading data produced by older libraries) would
>>> be possible as we can reason that a value other than the continuation
>>> value was produced by an older library (and then validate the
>>> Flatbuffer message of course). Arrow implementations could offer a
>>> backward compatibility mode for the sake of old readers if they desire
>>> (this may also assist with testing).
>>>
>>> The PR making these changes to the IPC documentation is here
>>>
>>> https://github.com/apache/arrow/pull/4951
>>>
>>> Please vote to accept this change. This vote will be open for at least 72 
>>> hours
>>>
>>> [ ] +1 Adopt the Arrow protocol change
>>> [ ] +0
>>> [ ] -1 I disagree because...
>>>
>>> Here is my vote: +1
>>>
>>> Thanks,
>>> Wes
>>>
>>> [1]: 
>>> https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E
>>>


[Discuss][Java] 64-bit lengths for ValueVectors

2019-08-07 Thread Micah Kornfield
Hi Gonzalo,
Thank you for the feedback.  I wasn't aware of the JIT implications.   At
least on the benchmark run they don't seem to have an impact.

If there are other benchmarks that people have that can validate if this
change will be problematic I would appreciate trying to run them with the
PR.  I will try to run the ones for zeroing/popcnt tonight to see if there
is a change in those.

-Micah



On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
golthir...@gmail.com> wrote:

> I would recommend to take care with this kind of changes.
>
> I didn't try Arrow in more than one year, but by then the performance was
> quite bad in comparison with plain byte buffer access
> (see http://git.net/apache-arrow-development/msg02353.html *) and
> there are several optimizations that the JVM (specifically, C2) does not
> apply when dealing with int instead of longs. One of the
> most commons is the loop unrolling and vectorization.
>
> * It doesn't seem the best way to reference an old email on the list, but
> it is the only result shown by Google
>
> El mié., 7 ago. 2019 a las 11:42, Fan Liya ()
> escribió:
>
>> Hi Micah,
>>
>> Thanks for your effort. The performance result looks good.
>>
>> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes for each
>> of length, write index, and read index).
>> Similar overheads also exist for vectors like BaseFixedWidthVector,
>> BaseVariableWidthVector, etc.
>>
>> IMO, such overheads are small enough to justify the change.
>> Let's check if there are other overheads.
>>
>> Best,
>> Liya Fan
>>
>> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield 
>> wrote:
>>
>> > Hi Liya Fan,
>> > Based on the Float8Benchmark there does not seem to be any meaningful
>> > performance difference on my machine.  At least for me, the benchmarks
>> are
>> > not stable enough to say one is faster than the other (I've pasted
>> results
>> > below).  That being said my machine isn't necessarily the most reliable
>> for
>> > benchmarking.
>> >
>> > On an intuitive level, this makes sense to me,  for the most part it
>> seems
>> > like the change just moves casting from "int" to "long" further up the
>> > stack  for  "PlatformDepdendent" operations.  If there are other
>> benchmarks
>> > that you think are worth running let me know.
>> >
>> > One downside performance wise I think for his change is it increases the
>> > size of ArrowBuf objects, which I suppose could influence cache misses
>> at
>> > some level or increase the size of call-stacks, but this doesn't seem to
>> > show up in the benchmark..
>> >
>> > Thanks,
>> > Micah
>> >
>> > Sample benchmark numbers:
>> >
>> > [New Code]
>> > BenchmarkMode  Cnt   Score   Error  Units
>> > Float8Benchmarks.copyFromBenchmark   avgt5  15.441 ± 0.469  us/op
>> > Float8Benchmarks.readWriteBenchmark  avgt5  14.057 ± 0.115  us/op
>> >
>> >
>> > [Old code]
>> > BenchmarkMode  Cnt   Score   Error  Units
>> > Float8Benchmarks.copyFromBenchmark   avgt5  16.248 ± 1.409  us/op
>> > Float8Benchmarks.readWriteBenchmark  avgt5  14.150 ± 0.084  us/op
>> >
>> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya  wrote:
>> >
>> >> Hi Micah,
>> >>
>> >> Thanks a lot for doing this.
>> >>
>> >> I am a little concerned about if there is any negative performance
>> impact
>> >> on the current 32-bit-length based applications.
>> >> Can we do some performance comparison on our existing benchmarks?
>> >>
>> >> Best,
>> >> Liya Fan
>> >>
>> >>
>> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield 
>> >> wrote:
>> >>
>> >>> There have been some previous discussions on the mailing about
>> supporting
>> >>> 64-bit lengths for  Java ValueVectors (this is what the IPC
>> specification
>> >>> and C++ support).  I created a PR [1] that changes all APIs that I
>> could
>> >>> find that take an index to take an "long" instead of an "int" (and
>> >>> similarly change "size/rowcount" APIs).
>> >>>
>> >>> It is a big change, so I think it is worth discussing if it is
>> something
>> >>> we
>> >>> still want to move forward with.  It would be nice to come to a
>> >>> conclusion
>> >>> quickly, ideally in the next few days, to avoid a lot of merge
>> conflicts.
>> >>>
>> >>> The reason I did this work now is the C++ implementation has added
>> >>> support
>> >>> for LargeList, LargeBinary and LargeString arrays and based on prior
>> >>> discussions we need to have similar support in Java before our next
>> >>> release. Support 64-bit indexes means we can have full compatibility
>> and
>> >>> make the most use of the types in Java.
>> >>>
>> >>> Look forward to hearing feedback.
>> >>>
>> >>> Thanks,
>> >>> Micah
>> >>>
>> >>> [1] https://github.com/apache/arrow/pull/5020
>> >>>
>> >>
>>
>


Re: [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements

2019-08-07 Thread Wes McKinney
You make a good point. For backward compatibility reasons, bytes 5
through 8 would need to be unspecified padding bytes, I think. Does
that sound right?

On Wed, Aug 7, 2019 at 9:02 AM Antoine Pitrou  wrote:
>
>
> This may be coming a bit late, but I realize we could take the
> opportunity to *also* make the end-of-stream marker a 8-bytes marker
> (rather than 4-bytes).  What do you think?
>
> Regards
>
> Antoine.
>
>
> Le 06/08/2019 à 22:15, Wes McKinney a écrit :
> > hi all,
> >
> > As we've been discussing for the last 5 weeks or so [1], there is a
> > need to introduce 4 bytes of padding into the preamble of the
> > "encapsulated IPC message" format to ensure that the Flatbuffers
> > metadata payload begins on an 8-byte aligned memory offset. The
> > alternative to this would be for Arrow implementations where alignment
> > is important (e.g. C or C++) to copy the metadata (which is not always
> > small) into memory when it is unaligned.
> >
> > Micah has proposed to address this by adding a 4-byte "continuation"
> > value at the beginning of the payload having the value 0x. The
> > reason to do it this way is that old clients will see an invalid
> > length (what is currently the first 4 bytes of the message -- a 32-bit
> > little endian signed integer indicating the metadata length) rather
> > than potentially crashing on a valid length.
> >
> > This would be a backwards incompatible protocol change, so older Arrow
> > libraries would not be able to read these new messages. Maintaining
> > forward compatibility (reading data produced by older libraries) would
> > be possible as we can reason that a value other than the continuation
> > value was produced by an older library (and then validate the
> > Flatbuffer message of course). Arrow implementations could offer a
> > backward compatibility mode for the sake of old readers if they desire
> > (this may also assist with testing).
> >
> > The PR making these changes to the IPC documentation is here
> >
> > https://github.com/apache/arrow/pull/4951
> >
> > Please vote to accept this change. This vote will be open for at least 72 
> > hours
> >
> > [ ] +1 Adopt the Arrow protocol change
> > [ ] +0
> > [ ] -1 I disagree because...
> >
> > Here is my vote: +1
> >
> > Thanks,
> > Wes
> >
> > [1]: 
> > https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E
> >


Re: [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements

2019-08-07 Thread Antoine Pitrou


This may be coming a bit late, but I realize we could take the
opportunity to *also* make the end-of-stream marker a 8-bytes marker
(rather than 4-bytes).  What do you think?

Regards

Antoine.


Le 06/08/2019 à 22:15, Wes McKinney a écrit :
> hi all,
> 
> As we've been discussing for the last 5 weeks or so [1], there is a
> need to introduce 4 bytes of padding into the preamble of the
> "encapsulated IPC message" format to ensure that the Flatbuffers
> metadata payload begins on an 8-byte aligned memory offset. The
> alternative to this would be for Arrow implementations where alignment
> is important (e.g. C or C++) to copy the metadata (which is not always
> small) into memory when it is unaligned.
> 
> Micah has proposed to address this by adding a 4-byte "continuation"
> value at the beginning of the payload having the value 0x. The
> reason to do it this way is that old clients will see an invalid
> length (what is currently the first 4 bytes of the message -- a 32-bit
> little endian signed integer indicating the metadata length) rather
> than potentially crashing on a valid length.
> 
> This would be a backwards incompatible protocol change, so older Arrow
> libraries would not be able to read these new messages. Maintaining
> forward compatibility (reading data produced by older libraries) would
> be possible as we can reason that a value other than the continuation
> value was produced by an older library (and then validate the
> Flatbuffer message of course). Arrow implementations could offer a
> backward compatibility mode for the sake of old readers if they desire
> (this may also assist with testing).
> 
> The PR making these changes to the IPC documentation is here
> 
> https://github.com/apache/arrow/pull/4951
> 
> Please vote to accept this change. This vote will be open for at least 72 
> hours
> 
> [ ] +1 Adopt the Arrow protocol change
> [ ] +0
> [ ] -1 I disagree because...
> 
> Here is my vote: +1
> 
> Thanks,
> Wes
> 
> [1]: 
> https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E
> 


[jira] [Created] (ARROW-6159) [C++] PrettyPrint of arrow::Schema missing identation for first line

2019-08-07 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6159:


 Summary: [C++] PrettyPrint of arrow::Schema missing identation for 
first line
 Key: ARROW-6159
 URL: https://issues.apache.org/jira/browse/ARROW-6159
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.14.1
Reporter: Joris Van den Bossche


Minor issue, but I noticed when printing a Schema with indentation, like:

{code}
  std::shared_ptr field1 = arrow::field("column1", 
arrow::int32());
  std::shared_ptr field2 = arrow::field("column2", arrow::utf8());

  std::shared_ptr schema = arrow::schema({field1, field2});

  arrow::PrettyPrintOptions options{4};
  arrow::PrettyPrint(*schema, options, ::cout);
{code}

you get 

{code}
column1: int32
column2: string
{code}

so not applying the indent for the first line.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-07 Thread Gonzalo Ortiz Jaureguizar
I would recommend to take care with this kind of changes.

I didn't try Arrow in more than one year, but by then the performance was
quite bad in comparison with plain byte buffer access
(see http://git.net/apache-arrow-development/msg02353.html *) and
there are several optimizations that the JVM (specifically, C2) does not
apply when dealing with int instead of longs. One of the
most commons is the loop unrolling and vectorization.

* It doesn't seem the best way to reference an old email on the list, but
it is the only result shown by Google

El mié., 7 ago. 2019 a las 11:42, Fan Liya ()
escribió:

> Hi Micah,
>
> Thanks for your effort. The performance result looks good.
>
> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes for each
> of length, write index, and read index).
> Similar overheads also exist for vectors like BaseFixedWidthVector,
> BaseVariableWidthVector, etc.
>
> IMO, such overheads are small enough to justify the change.
> Let's check if there are other overheads.
>
> Best,
> Liya Fan
>
> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield 
> wrote:
>
> > Hi Liya Fan,
> > Based on the Float8Benchmark there does not seem to be any meaningful
> > performance difference on my machine.  At least for me, the benchmarks
> are
> > not stable enough to say one is faster than the other (I've pasted
> results
> > below).  That being said my machine isn't necessarily the most reliable
> for
> > benchmarking.
> >
> > On an intuitive level, this makes sense to me,  for the most part it
> seems
> > like the change just moves casting from "int" to "long" further up the
> > stack  for  "PlatformDepdendent" operations.  If there are other
> benchmarks
> > that you think are worth running let me know.
> >
> > One downside performance wise I think for his change is it increases the
> > size of ArrowBuf objects, which I suppose could influence cache misses at
> > some level or increase the size of call-stacks, but this doesn't seem to
> > show up in the benchmark..
> >
> > Thanks,
> > Micah
> >
> > Sample benchmark numbers:
> >
> > [New Code]
> > BenchmarkMode  Cnt   Score   Error  Units
> > Float8Benchmarks.copyFromBenchmark   avgt5  15.441 ± 0.469  us/op
> > Float8Benchmarks.readWriteBenchmark  avgt5  14.057 ± 0.115  us/op
> >
> >
> > [Old code]
> > BenchmarkMode  Cnt   Score   Error  Units
> > Float8Benchmarks.copyFromBenchmark   avgt5  16.248 ± 1.409  us/op
> > Float8Benchmarks.readWriteBenchmark  avgt5  14.150 ± 0.084  us/op
> >
> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya  wrote:
> >
> >> Hi Micah,
> >>
> >> Thanks a lot for doing this.
> >>
> >> I am a little concerned about if there is any negative performance
> impact
> >> on the current 32-bit-length based applications.
> >> Can we do some performance comparison on our existing benchmarks?
> >>
> >> Best,
> >> Liya Fan
> >>
> >>
> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield 
> >> wrote:
> >>
> >>> There have been some previous discussions on the mailing about
> supporting
> >>> 64-bit lengths for  Java ValueVectors (this is what the IPC
> specification
> >>> and C++ support).  I created a PR [1] that changes all APIs that I
> could
> >>> find that take an index to take an "long" instead of an "int" (and
> >>> similarly change "size/rowcount" APIs).
> >>>
> >>> It is a big change, so I think it is worth discussing if it is
> something
> >>> we
> >>> still want to move forward with.  It would be nice to come to a
> >>> conclusion
> >>> quickly, ideally in the next few days, to avoid a lot of merge
> conflicts.
> >>>
> >>> The reason I did this work now is the C++ implementation has added
> >>> support
> >>> for LargeList, LargeBinary and LargeString arrays and based on prior
> >>> discussions we need to have similar support in Java before our next
> >>> release. Support 64-bit indexes means we can have full compatibility
> and
> >>> make the most use of the types in Java.
> >>>
> >>> Look forward to hearing feedback.
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>> [1] https://github.com/apache/arrow/pull/5020
> >>>
> >>
>


[jira] [Created] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types

2019-08-07 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6158:


 Summary: [Python] possible to create StructArray with type that 
conflicts with child array's types
 Key: ARROW-6158
 URL: https://issues.apache.org/jira/browse/ARROW-6158
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Using the Python interface as example. This creates a {{StructArray}} where the 
field types don't match the child array types:

{code}
a = pa.array([1, 2, 3], type=pa.int64())
b = pa.array(['a', 'b', 'c'], type=pa.string())
inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())]

a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) 
{code}

The above works fine. I didn't find anything that errors (eg conversion to 
pandas, slicing), also validation passes, but the type actually has the 
inconsistent child types:

{code}
In [2]: a
Out[2]: 

-- is_valid: all not null
-- child 0 type: int64
  [
1,
2,
3
  ]
-- child 1 type: string
  [
"a",
"b",
"c"
  ]

In [3]: a.type
Out[3]: StructType(struct)

In [4]: a.to_pandas()
Out[4]: 
array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}],
  dtype=object)

In [5]: a.validate() 
{code}

Shouldn't this be disallowed somehow? (it could be checked in the Python 
{{from_arrays}} method, but maybe also in {{StructArray::Make}} which already 
checks for the number of fields vs arrays and a consistent array length). 

Similarly to discussion in ARROW-6132, I would also expect that this the 
{{ValidateArray}} catches this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-08-07 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6157:


 Summary: [Python][C++] UnionArray with invalid data passes 
validation / leads to segfaults
 Key: ARROW-6157
 URL: https://issues.apache.org/jira/browse/ARROW-6157
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche


>From the Python side, you can create an "invalid" UnionArray:

{code}
binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
int64 = pa.array([1, 2, 3], type='int64') 
types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
of bound for number of childs
value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')

a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
{code}

Eg on conversion to python this leads to a segfault:

{code}
In [7]: a.to_pylist()
Segmentation fault (core dumped)
{code}

On the other hand, doing an explicit validation does not give an error:

{code}
In [8]: a.validate()
{code}

Should the validation raise errors for this case? (the C++ {{ValidateVisitor}} 
for UnionArray does nothing)




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6156) [Java] Support compare semantics for ArrowBufPointer

2019-08-07 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6156:
---

 Summary: [Java] Support compare semantics for ArrowBufPointer
 Key: ARROW-6156
 URL: https://issues.apache.org/jira/browse/ARROW-6156
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Compare two arrow buffer pointers by their content in lexicographic order.

null is smaller and shorter buffer is smaller.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-07 Thread Fan Liya
Hi Micah,

Thanks for your effort. The performance result looks good.

As you indicated, ArrowBuf will take additional 12 bytes (4 bytes for each
of length, write index, and read index).
Similar overheads also exist for vectors like BaseFixedWidthVector,
BaseVariableWidthVector, etc.

IMO, such overheads are small enough to justify the change.
Let's check if there are other overheads.

Best,
Liya Fan

On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield 
wrote:

> Hi Liya Fan,
> Based on the Float8Benchmark there does not seem to be any meaningful
> performance difference on my machine.  At least for me, the benchmarks are
> not stable enough to say one is faster than the other (I've pasted results
> below).  That being said my machine isn't necessarily the most reliable for
> benchmarking.
>
> On an intuitive level, this makes sense to me,  for the most part it seems
> like the change just moves casting from "int" to "long" further up the
> stack  for  "PlatformDepdendent" operations.  If there are other benchmarks
> that you think are worth running let me know.
>
> One downside performance wise I think for his change is it increases the
> size of ArrowBuf objects, which I suppose could influence cache misses at
> some level or increase the size of call-stacks, but this doesn't seem to
> show up in the benchmark..
>
> Thanks,
> Micah
>
> Sample benchmark numbers:
>
> [New Code]
> BenchmarkMode  Cnt   Score   Error  Units
> Float8Benchmarks.copyFromBenchmark   avgt5  15.441 ± 0.469  us/op
> Float8Benchmarks.readWriteBenchmark  avgt5  14.057 ± 0.115  us/op
>
>
> [Old code]
> BenchmarkMode  Cnt   Score   Error  Units
> Float8Benchmarks.copyFromBenchmark   avgt5  16.248 ± 1.409  us/op
> Float8Benchmarks.readWriteBenchmark  avgt5  14.150 ± 0.084  us/op
>
> On Tue, Aug 6, 2019 at 1:18 AM Fan Liya  wrote:
>
>> Hi Micah,
>>
>> Thanks a lot for doing this.
>>
>> I am a little concerned about if there is any negative performance impact
>> on the current 32-bit-length based applications.
>> Can we do some performance comparison on our existing benchmarks?
>>
>> Best,
>> Liya Fan
>>
>>
>> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield 
>> wrote:
>>
>>> There have been some previous discussions on the mailing about supporting
>>> 64-bit lengths for  Java ValueVectors (this is what the IPC specification
>>> and C++ support).  I created a PR [1] that changes all APIs that I could
>>> find that take an index to take an "long" instead of an "int" (and
>>> similarly change "size/rowcount" APIs).
>>>
>>> It is a big change, so I think it is worth discussing if it is something
>>> we
>>> still want to move forward with.  It would be nice to come to a
>>> conclusion
>>> quickly, ideally in the next few days, to avoid a lot of merge conflicts.
>>>
>>> The reason I did this work now is the C++ implementation has added
>>> support
>>> for LargeList, LargeBinary and LargeString arrays and based on prior
>>> discussions we need to have similar support in Java before our next
>>> release. Support 64-bit indexes means we can have full compatibility and
>>> make the most use of the types in Java.
>>>
>>> Look forward to hearing feedback.
>>>
>>> Thanks,
>>> Micah
>>>
>>> [1] https://github.com/apache/arrow/pull/5020
>>>
>>


Aw: Re: [Discuss] C++ filenames: hyphens or underscores?

2019-08-07 Thread hans-joachim . bothe
Hi,

I struggled long time if underscore or hyphen is better.
The result was a mess, using both - never sure which one i actually used in a 
certain case.

Today i am consistently using underscore in variable names and hyphens in file 
names.
- No more trial and error.
- Compliant to Linux naming, as Kou rightly mentioned.
- No chance to use hyphens in variable names - so i never forget which way it 
is.

Hans



> Gesendet: Mittwoch, 07. August 2019 um 07:44 Uhr
> Von: "Philipp Moritz" 
> An: dev@arrow.apache.org, emkornfi...@gmail.com
> Betreff: Re: [Discuss] C++ filenames: hyphens or underscores?
>
> I also have a small preference for underscores but would also be fine with
> dashes.
> 
> It seems to be more common (therefore blends better with vendored code) and
> agrees with the styleguide and is closest to the exiting code. Also as an
> aside, having file_names names like variable_names is nice. Compare the
> Lispy way of using dashes for both.
> 
> Thanks for getting this discussion started, the mixture of dashes and
> underscores has been bothering me too :)
> 
> On Tue, Aug 6, 2019 at 8:41 PM Micah Kornfield 
> wrote:
> 
> > I also have a preference for underscore but can get used to anything.
> >
> > I agree with the points François made above about the recommendation of the
> > style guide and the smaller change to the existing code base.
> >
> > On Tue, Aug 6, 2019 at 6:52 PM Francois Saint-Jacques <
> > fsaintjacq...@gmail.com> wrote:
> >
> > > My vote would go with underscore to minimize changes and minimize
> > > exceptions to the google style guide reference. I also suggests that
> > > we add this to the linters somehow, if it's not too much trouble.
> > >
> > > François
> > >
> > > On Tue, Aug 6, 2019 at 9:35 PM Sutou Kouhei  wrote:
> > > >
> > > > Hi,
> > > >
> > > > I like hyphens.
> > > >
> > > > Because many Linux commands use hyphens than
> > > > underscores. Here are counts on my Debian GNU/Linux machine:
> > > >
> > > > % ls /usr/bin/ | grep -- - | wc -l
> > > > 956
> > > > % ls /usr/bin/ | grep _ | wc -l
> > > > 343
> > > >
> > > >
> > > > Thanks,
> > > > --
> > > > kou
> > > >
> > > > In <20190806140340.2a7ffab2@fsol>
> > > >   "[Discuss] C++ filenames: hyphens or underscores?" on Tue, 6 Aug 2019
> > > 14:03:40 +0200,
> > > >   Antoine Pitrou  wrote:
> > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > The filenames in the C++ source tree are a bit ad hoc and
> > inconsistent.
> > > > > Sometimes they use hyphens for word separation, sometimes
> > underscores.
> > > > > In ARROW-4648 it was proposed that we unify C++ file naming,
> > therefore
> > > > > there are two possible options: only hyphens, or only underscores.
> > > > >
> > > > > What are your preferences?  Personally, I have a slight preference
> > for
> > > > > hyphens, especially as they are already used in binary names.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > >
> >
>


Re: [DISCUSS] Add GetFlightSchema to Flight RPC

2019-08-07 Thread Ryan Murray
As per everyone's feedback I have renamed GetFlightSchema -> GetSchema and
have removed the descriptor on the rpc result message. The doc has been
updated as has the draft PR

On Thu, Aug 1, 2019 at 6:32 PM Bryan Cutler  wrote:

> Sounds good to me, I would just echo what others have said.
>
> On Thu, Aug 1, 2019 at 8:17 AM Ryan Murray  wrote:
>
> > Thanks Wes,
> >
> > The descriptor is only there to maintain a bit of symmetry with
> > GetFlightInfo. Happy to remove it, I don't think its necessary and
> already
> > a few people agree. Similar with the method name, I am neutral to the
> > naming and can call it whatever the community is happy with.
> >
> > Best,
> > Ryan
> >
> > On Thu, Aug 1, 2019 at 3:56 PM Wes McKinney  wrote:
> >
> > > I'm generally supporting of adding the new RPC endpoint.
> > >
> > > To make a couple points from the document
> > >
> > > * I'm not sure what the purpose of returning the FlightDescriptor is,
> > > but I haven't thought too hard about it
> > > * The Schema consists of a single IPC message -- dictionaries will
> > > appear in the actual DoGet stream. To motivate why this is --
> > > different endpoints might have different dictionaries corresponding to
> > > fields in the schema, to have static/constant dictionaries in a
> > > distributed Flight setting is likely to be impractical. I summarize
> > > the issue as "dictionaries are data, not metadata".
> > > * I would be OK calling this GetSchema instead of GetFlightSchema but
> > > either is okay
> > >
> > > - Wes
> > >
> > > On Thu, Aug 1, 2019 at 8:08 AM David Li  wrote:
> > > >
> > > > Hi Ryan,
> > > >
> > > > Thanks for writing this up! I made a couple of minor comments in the
> > > > doc/implementation, but overall I'm in favor of having this RPC
> > > > method.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On 8/1/19, Ryan Murray  wrote:
> > > > > Hi All,
> > > > >
> > > > > Please see the attached document for a proposed addition to the
> > Flight
> > > > > RPC[1]. This is the result of a previous mailing list
> discussion[2].
> > > > >
> > > > > I have created the Pull Request[3] to make the proposal a little
> more
> > > > > concrete.
> > > > > 
> > > > > Please let me know if you have any questions or concerns.
> > > > >
> > > > > Best,
> > > > > Ryan
> > > > >
> > > > > [1]:
> > > > >
> > >
> >
> https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing
> > > > > [2]:
> > > > >
> > >
> >
> https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E
> > > > > [3]: https://github.com/apache/arrow/pull/4980
> > > > >
> > >
> >
> >
> > --
> >
> > Ryan Murray  | Principal Consulting Engineer
> >
> > +447540852009 | rym...@dremio.com
> >
> > 
> > Check out our GitHub , join our community
> > site  & Download Dremio
> > 
> >
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com


Check out our GitHub , join our community
site  & Download Dremio



Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-07 Thread Micah Kornfield
Hi Liya Fan,
Based on the Float8Benchmark there does not seem to be any meaningful
performance difference on my machine.  At least for me, the benchmarks are
not stable enough to say one is faster than the other (I've pasted results
below).  That being said my machine isn't necessarily the most reliable for
benchmarking.

On an intuitive level, this makes sense to me,  for the most part it seems
like the change just moves casting from "int" to "long" further up the
stack  for  "PlatformDepdendent" operations.  If there are other benchmarks
that you think are worth running let me know.

One downside performance wise I think for his change is it increases the
size of ArrowBuf objects, which I suppose could influence cache misses at
some level or increase the size of call-stacks, but this doesn't seem to
show up in the benchmark..

Thanks,
Micah

Sample benchmark numbers:

[New Code]
BenchmarkMode  Cnt   Score   Error  Units
Float8Benchmarks.copyFromBenchmark   avgt5  15.441 ± 0.469  us/op
Float8Benchmarks.readWriteBenchmark  avgt5  14.057 ± 0.115  us/op


[Old code]
BenchmarkMode  Cnt   Score   Error  Units
Float8Benchmarks.copyFromBenchmark   avgt5  16.248 ± 1.409  us/op
Float8Benchmarks.readWriteBenchmark  avgt5  14.150 ± 0.084  us/op

On Tue, Aug 6, 2019 at 1:18 AM Fan Liya  wrote:

> Hi Micah,
>
> Thanks a lot for doing this.
>
> I am a little concerned about if there is any negative performance impact
> on the current 32-bit-length based applications.
> Can we do some performance comparison on our existing benchmarks?
>
> Best,
> Liya Fan
>
>
> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield 
> wrote:
>
>> There have been some previous discussions on the mailing about supporting
>> 64-bit lengths for  Java ValueVectors (this is what the IPC specification
>> and C++ support).  I created a PR [1] that changes all APIs that I could
>> find that take an index to take an "long" instead of an "int" (and
>> similarly change "size/rowcount" APIs).
>>
>> It is a big change, so I think it is worth discussing if it is something
>> we
>> still want to move forward with.  It would be nice to come to a conclusion
>> quickly, ideally in the next few days, to avoid a lot of merge conflicts.
>>
>> The reason I did this work now is the C++ implementation has added support
>> for LargeList, LargeBinary and LargeString arrays and based on prior
>> discussions we need to have similar support in Java before our next
>> release. Support 64-bit indexes means we can have full compatibility and
>> make the most use of the types in Java.
>>
>> Look forward to hearing feedback.
>>
>> Thanks,
>> Micah
>>
>> [1] https://github.com/apache/arrow/pull/5020
>>
>