RecordBatch.length vs. Buffer.length?

2019-05-03 Thread Jeffrey Green
Hello!

I'm using the Java API for Arrow and am finding some ambiguity between the
length field in a RecordBatch and the "byte-width-adjusted" length field in
a Buffer.

As per https://arrow.apache.org/docs/format/Metadata.html under the "Record
data headers" section:
"A record batch is a collection of top-level named, equal length Arrow
arrays (or vectors)."

This seems to correspond to org.apache.arrow.flatbuf.RecordBatch.length()
when reading and VectorSchemaRoot.setRowCount() when writing.  In addition
to this field, each array buffer has its own specific length in bytes.

As a library developer (particularly on the consumer side), what is the
proper behavior when these two numbers don't match or when array lengths
don't match each other?  For example, I can use the ArrowFileWriter to
create a two-column file where I setRowCount to 8, add 100 ints to the
first column and 300 ints to the second column and everything seems to
"work" fine even though this doesn't seem to be internally consistent.

If these various length fields are supposed to correspond to each other /
represent the same thing, then having two different accounts of the same
value seems error-prone and ambiguous.  Why does the format not exclusively
use RecordBatch.length combined with each array's bitWidth?  The product of
the two seems like it should be equivalent to Buffer.length.

As such, I think I must be missing something and am looking for more
clarity on how to think about and process RecordBatch.length and
Buffer.length (once I divide by bytesPerElement).

Thanks.


Re: [VOTE] Add new DurationInterval Type to Arrow Format

2019-05-03 Thread Wes McKinney
I've just reviewed the format and C++ changes in
https://github.com/apache/arrow/pull/3644 which look good to me modulo
minor comments.

Can someone take a look at the Java changes soon so we move this
toward completion?

One question came up of whether "DurationInterval" is the name we
want. It might be more clear to call it simply "Duration"

On Tue, Apr 30, 2019 at 6:57 PM Micah Kornfield  wrote:
>
> Sorry for the type OK, I think https://github.com/apache/arrow/pull/3644 is
> now ready to review.
>
> On Tue, Apr 30, 2019 at 4:56 PM Micah Kornfield 
> wrote:
>
> > OK, I think https://github.com/apache/arrow/pull/3644 is no ready to
> > review.
> >
> > It includes Java implementation of DurationInterval and C++
> > implementations of DurationInterval and the original interval types.  I
> > added documentation to Schema.fbs regarding the original interval types
> > (TL;DR; YEAR_MONTH is expected to be supported by all implementations
> > DAY_TIME is not, which I believe as based on previous ML conversations).
> > Please let me know if there are issues with this language and I can remove
> > it.
> >
> >
> > On Monday, April 8, 2019, Krisztián Szűcs 
> > wrote:
> >
> >> The vote carries with 4 binding +1 votes.
> >>
> >> Micah, what are the next steps?
> >> Are You going to finalize the PR?
> >>
> >> On Sun, Apr 7, 2019 at 11:13 AM Uwe L. Korn  wrote:
> >>
> >> > +1 (binding)
> >> >
> >> > On Sat, Apr 6, 2019, at 2:44 AM, Kouhei Sutou wrote:
> >> > > +1 (binding)
> >> > >
> >> > > In  >> p...@mail.gmail.com>
> >> > >   "[VOTE] Add new DurationInterval Type to Arrow Format" on Wed, 3 Apr
> >> > > 2019 07:59:56 -0700,
> >> > >   Jacques Nadeau  wrote:
> >> > >
> >> > > > I'd like to propose a change to the Arrow format to support a new
> >> > duration
> >> > > > type. Details below. Threads on mailing list around discussion.
> >> > > >
> >> > > >
> >> > > > // An absolute length of time unrelated to any calendar artifacts.
> >> > For the
> >> > > > purposes
> >> > > > /// of Arrow Implementations, adding this value to a Timestamp
> >> ("t1")
> >> > > > naively (i.e. simply summing
> >> > > > /// the two number) is acceptable even though in some cases the
> >> > resulting
> >> > > > Timestamp (t2) would
> >> > > > /// not account for leap-seconds during the elapsed time between
> >> "t1"
> >> > and
> >> > > > "t2".  Similarly, representing
> >> > > > /// the difference between two Unix timestamp is acceptable, but
> >> would
> >> > > > yield a value that is possibly a few seconds
> >> > > > /// off from the true elapsed time.
> >> > > > ///
> >> > > > ///  The resolution defaults to
> >> > > > /// millisecond, but can be any of the other supported TimeUnit
> >> values
> >> > as
> >> > > > /// with Timestamp and Time types.  This type is always represented
> >> as
> >> > > > /// an 8-byte integer.
> >> > > > table DurationInterval {
> >> > > >unit: TimeUnit = MILLISECOND;
> >> > > > }
> >> > > >
> >> > > >
> >> > > > Please vote whether to accept the changes. The vote will be open
> >> > > > for at least 72 hours.
> >> > > >
> >> > > > [ ] +1 Accept these changes to the Flight protocol
> >> > > > [ ] +0
> >> > > > [ ] -1 Do not accept the changes because...
> >> > >
> >> >
> >>
> >


[jira] [Created] (ARROW-5257) [Website] Update site to use "official" Apache Arrow logo, add clearly marked links to logo

2019-05-03 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5257:
---

 Summary: [Website] Update site to use "official" Apache Arrow 
logo, add clearly marked links to logo
 Key: ARROW-5257
 URL: https://issues.apache.org/jira/browse/ARROW-5257
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Wes McKinney
 Fix For: 0.14.0


See logo at 
https://docs.google.com/presentation/d/1qmvPpFU7sdm9l6A6LEyI0zIzswGtJW0Sbd_lfHLaXQs/edit#slide=id.g4258234456_0_1

An unofficial logo lacking the "Apache" has been making the rounds on the 
internet, so I think it would be a good idea to update our web properties with 
the approved logo as discussed on the mailing list

Whoever does this task -- please make sure to compress the PNG asset of the 
logo prior to checking in to source control



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5256) [Packaging][deb] Failed to build with LLVM 7.1.0

2019-05-03 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-5256:
---

 Summary: [Packaging][deb] Failed to build with LLVM 7.1.0
 Key: ARROW-5256
 URL: https://issues.apache.org/jira/browse/ARROW-5256
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva, Packaging
Reporter: Kouhei Sutou


https://travis-ci.org/ursa-labs/crossbow/builds/527710714#L6144-L6157

{noformat}
CMake Error at cmake_modules/FindLLVM.cmake:33 (find_package):
  Could not find a configuration file for package "LLVM" that is compatible
  with requested version "7.0".
  The following configuration files were considered but not accepted:
/usr/lib/llvm-7/cmake/LLVMConfig.cmake, version: 7.1.0
/usr/lib/llvm-7/lib/cmake/llvm/LLVMConfig.cmake, version: 7.1.0
/usr/lib/llvm-7/share/llvm/cmake/LLVMConfig.cmake, version: 7.1.0
/usr/lib/llvm-3.8/share/llvm/cmake/LLVMConfig.cmake, version: 3.8.1
/usr/share/llvm-3.8/cmake/LLVMConfig.cmake, version: 3.8.1
Call Stack (most recent call first):
  src/gandiva/CMakeLists.txt:31 (find_package)
{noformat}

Can we use "7" instead of "7.0" for {{ARROW_LLVM_VERSION}}?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: How about inet4/inet6/macaddr data types?

2019-05-03 Thread David Li
Sure, I've created https://issues.apache.org/jira/browse/ARROW-5255.

PR: https://github.com/apache/arrow/pull/4251

I'm not sure if what I'm doing with my Vector subclass is quite right,
but we'd especially like this in Java, so happy to work through any
feedback.

Also, as part of this discussion, I think the original C++
implementation noted that this metadata would not round-trip through
Pandas. We would definitely like that feature if possible - maybe
column-level metadata could be saved under a special field in the
dataframe-level Pandas metadata?

Best,
David

On 5/3/19, Wes McKinney  wrote:
> hi David -- would you like to open a PR and corresponding JIRA issue
> for discussion? We might want to hold a vote to formalize the
> extension type mechanism (and to fix the metadata names -- I agree
> that having an ARROW namespace would be better than what we are doing
> now)
>
> On Thu, May 2, 2019 at 7:02 AM David Li  wrote:
>>
>> Re: Java support, I've sketched out an implementation:
>> https://github.com/lihalite/arrow/pull/2
>>
>> On 5/1/19, Micah Kornfield  wrote:
>> >>
>> >> I'm awaiting community feedback about the approach to implementing
>> >> extension types, whether the approach that I've used (using the
>> >> following keys in custom_metadata [1]) is the one that we want to use
>> >> longer-term. This certainly seems like a good time to have that
>> >> discussion. If there is consensus then we can document it formally in
>> >> the specification documents, and we probably will want to hold a vote
>> >> to ensure that we are in agreement.
>> >>
>> >
>> > Please let me know if this is best on a separate thread. I think I
>> > would
>> > feel more comfortable finalizing this if we had a few more examples
>> > exercising the functionality.  Inet, seems like a complicated enough
>> > use-case for modeling which would make it a good use-case (It seems like
>> > it
>> > might involve a struct/union?).  I also presume we will need a Java
>> > implementation, before we finalize anything?
>> >
>> > A small amount of bikeshedding on key names: We should probably take a
>> > namespace reservation approach for custom metadata in Schema.fbs [1].
>> > In
>> > this regard I have a small preference for something reserving all
>> > metadata
>> > with something like "ARROW:" or "ARROW." (not an
>> > underscore, and I'm open to different capitalization.)  This seems to be
>> > a
>> > similar approach to how avro reserves metadata keys [2].
>> >
>> > [1]
>> > https://github.com/apache/arrow/blob/b8aeb79e94a5a507aeec55d0b6c6bf5d7f0100b2/format/Schema.fbs#L264
>> > [2] https://avro.apache.org/docs/1.8.1/spec.html
>> >
>


[jira] [Created] (ARROW-5255) [Java] Implement user-defined data types API

2019-05-03 Thread David Li (JIRA)
David Li created ARROW-5255:
---

 Summary: [Java] Implement user-defined data types API
 Key: ARROW-5255
 URL: https://issues.apache.org/jira/browse/ARROW-5255
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: David Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5254) [Flight][Java] DoAction does not support result streams

2019-05-03 Thread David Li (JIRA)
David Li created ARROW-5254:
---

 Summary: [Flight][Java] DoAction does not support result streams
 Key: ARROW-5254
 URL: https://issues.apache.org/jira/browse/ARROW-5254
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Java
Reporter: David Li
Assignee: David Li
 Fix For: 0.14.0


While Flight defines DoAction as returning a stream of results, the Java APIs 
only allow returning a single result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: How about inet4/inet6/macaddr data types?

2019-05-03 Thread Wes McKinney
hi David -- would you like to open a PR and corresponding JIRA issue
for discussion? We might want to hold a vote to formalize the
extension type mechanism (and to fix the metadata names -- I agree
that having an ARROW namespace would be better than what we are doing
now)

On Thu, May 2, 2019 at 7:02 AM David Li  wrote:
>
> Re: Java support, I've sketched out an implementation:
> https://github.com/lihalite/arrow/pull/2
>
> On 5/1/19, Micah Kornfield  wrote:
> >>
> >> I'm awaiting community feedback about the approach to implementing
> >> extension types, whether the approach that I've used (using the
> >> following keys in custom_metadata [1]) is the one that we want to use
> >> longer-term. This certainly seems like a good time to have that
> >> discussion. If there is consensus then we can document it formally in
> >> the specification documents, and we probably will want to hold a vote
> >> to ensure that we are in agreement.
> >>
> >
> > Please let me know if this is best on a separate thread. I think I would
> > feel more comfortable finalizing this if we had a few more examples
> > exercising the functionality.  Inet, seems like a complicated enough
> > use-case for modeling which would make it a good use-case (It seems like it
> > might involve a struct/union?).  I also presume we will need a Java
> > implementation, before we finalize anything?
> >
> > A small amount of bikeshedding on key names: We should probably take a
> > namespace reservation approach for custom metadata in Schema.fbs [1].  In
> > this regard I have a small preference for something reserving all metadata
> > with something like "ARROW:" or "ARROW." (not an
> > underscore, and I'm open to different capitalization.)  This seems to be a
> > similar approach to how avro reserves metadata keys [2].
> >
> > [1]
> > https://github.com/apache/arrow/blob/b8aeb79e94a5a507aeec55d0b6c6bf5d7f0100b2/format/Schema.fbs#L264
> > [2] https://avro.apache.org/docs/1.8.1/spec.html
> >


Re: PARQUET-1411 / PR 4185

2019-05-03 Thread TP Boudreau
No need for apologies, Wes, I appreciate you keeping this on your radar.

I've made the changes and have pushed them to the PR branch.  You can begin
your review when you get the chance.

--TPB

On Thu, May 2, 2019 at 3:32 PM Wes McKinney  wrote:

> + Parquet dev list
>
> Thanks Tim for working on this issue, I'm sorry I haven't been able to
> leave code review yet -- I've been busy with a bunch of other things
> and, since it's a large patch, I wanted to give thoughtful feedback.
>
> Feel free to push some more commits to that PR. I can prioritize
> getting you some feedback in the next couple of working days, just let
> me know when you're ready for me to review.
>
> On Thu, May 2, 2019 at 5:23 PM TP Boudreau  wrote:
> >
> > Hello Parquet-Arrow Team, Wes,
> >
> > A short while ago, I submitted PR 4185 (
> > https://github.com/apache/arrow/pull/4185) to implement in the C++
> library
> > the new logical annotations metadata available in the latest
> parquet.thrift
> > spec (https://issues.apache.org/jira/browse/PARQUET-1411).  I stopped
> > committing to that PR's branch about a week ago to allow the code to be
> > reviewed without it being a moving target.
> >
> > I've since (optimistically) starting blocking out new code for ARROW-3729
> > based on my open PR (switching Arrow to read/write the new Parquet
> > annotations, https://issues.apache.org/jira/browse/ARROW-3729) and while
> > doing that realized that usage of the annotations classes I created in
> the
> > open PR might be smoother with the introduction of a few convenience
> > methods.  However, the most suitable names for these methods (IMO) were
> > introduced for another purpose in the open PR and would need to be
> > reclaimed -- overall a fairly minor, non-structural change to the PR
> code.
> >
> > I can either add another commit to the open PR to add these convenience
> > methods and rename some things (which would be my preference, provided no
> > one has invested too much time yet reviewing it -- maybe you have Wes?),
> or
> > I can wait for the first round of reviews on that PR to see where things
> > stand.
> >
> > How should I proceed?
> >
> > Thanks in advance,
> > --Tim
>


RE: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-05-03 Thread Jed Brown
"Malakhov, Anton"  writes:

>> > the library creates threads internally.  It's a disaster for managing
>> > oversubscription and affinity issues among groups of threads and/or
>> > multiple processes (e.g., MPI).
>
> This is exactly what I'm talking about referring as issues with threading 
> composability! OpenMP is not easy to have inside a library. I described it in 
> this document: 
> https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+Engine

Thanks for this document.  I'm no great fan of OpenMP, but it's being
billed by most vendors (especially Intel) as the way to go in the
scientific computing space and has become relatively popular (much more
so than TBB).

You linked to a NumPy discussion
(https://github.com/numpy/numpy/issues/11826) that is encountering the
same issues, but proposing solutions based on the global environment.
That is perhaps acceptable for typical Python callers due to the GIL,
but C++ callers may be using threads themselves.  A typical example:

App:
  calls libB sequentially:
calls Arrow sequentially (wants to use threads)
  calls libC sequentially:
omp parallel (creates threads somehow):
  calls Arrow from threads (Arrow should not create more)
  omp parallel:
calls libD from threads:
  calls Arrow (Arrow should not create more)

Arrow doesn't need to know the difference between the libC and libD
cases, but it may make a difference to the implementation of those
libraries.  In both of these cases, the user may desire that Arrow
create tasks for load balancing reasons (but no new threads) so long as
they can run on the specified thread team.

I have yet to see a complete solution to this problem, but we should
work out which modes are worth supporting and how that interface would
look.


Global solutions like this one (linked by Antoine)

  
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/thread-pool.cc#L268

imply that threading mode is global and set via an environment variable,
neither of which are true in cases such as the above (and many simpler
cases).


Re: ARROW-3191: Making ArrowBuf work with arbitrary memory and setting io.netty.tryReflectionSetAccessible to true for java builds

2019-05-03 Thread Bryan Cutler
Hi Sidd,

Does setting the system property io.netty.tryReflectionSetAccessible to
true have any other adverse effect other than those warnings during build?

Bryan

On Thu, May 2, 2019 at 8:43 PM Jacques Nadeau  wrote:

> I'm onboard with this change.
>
> On Fri, Apr 26, 2019 at 2:14 AM Siddharth Teotia 
> wrote:
>
> > As part of working on this patch <
> > https://github.com/apache/arrow/pull/4151>,
> > I ran into a problem with jdk 9 and 11 builds.  Since memory underlying
> > ArrowBuf may not necessarily be a ByteBuf (or any of its extensions),
> > methods like nioBuffer() can no longer be delegated as
> > UnsafeDirectLittleEndian.nioBuffer() to Netty implementation.
> >
> > So I used PlatformDependent.directBuffer(memory address, size) to create
> a
> > direct Byte Buffer  to closely mimic what Netty was originally doing
> > underneath for nioBuffer(). It turns out that PlatformDependent code in
> > netty first checks for the existence of constructor DirectByteBuffer(long
> > address, int size) as seen here
> > <
> >
> https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L223
> > >.
> > The constructor (long address, int size) is very well available in jdk
> 8, 9
> > and 11 but on the next line it tries to set it accessible. The reflection
> > based access is disabled by default in netty code for jdk >= 9 as seen
> here
> > <
> >
> https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L829
> > >.
> > Thus calls to PlatformDependent.directBuffer(address, size) were failing
> in
> > travis CI builds for JDK 9 and 11 with UnsupportedOperationException as
> > seen here
> > <
> >
> https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent.java#L415
> > >
> > and
> > this was because of the decision that was taken by netty at startup w.r.t
> > whether to provide access to constructor or not.
> >
> > We should set io.netty.tryReflectionSetAccessible system property to true
> > in java root pom
> >
> > I want to make sure people are aware and agree/disagree with this change.
> >
> > The tests now give the following warning:
> >
> > WARNING: An illegal reflective access operation has occurred
> > WARNING: Illegal reflective access by
> io.netty.util.internal.ReflectionUtil
> >
> >
> (file:/Users/siddharthteotia/.m2/repository/io/netty/netty-common/4.1.22.Final/netty-common-4.1.22.Final.jar)
> > to constructor java.nio.DirectByteBuffer(long,int)
> > WARNING: Please consider reporting this to the maintainers of
> > io.netty.util.internal.ReflectionUtil
> > WARNING: Use --illegal-access=warn to enable warnings of further illegal
> > reflective access operations
> > WARNING: All illegal access operations will be denied in a future release
> >
> > Thanks.
> > On Thu, Apr 18, 2019 at 3:39 PM Siddharth Teotia 
> > wrote:
> >
> > > I  have made all the necessary changes in java code to work with new
> > > ArrowBuf, ReferenceManager interfaces. More importantly, there is a
> > wrapper
> > > buffer NettyArrowBuf interface to comply with usage in RPC and Netty
> > > related code. It will be good to get feedback on this one (and of
> course
> > > all other changes).  As of now, the java modules build fine but I have
> to
> > > fix test failures. That is in progress.
> > >
> > > On Wed, Apr 17, 2019 at 6:41 AM Jacques Nadeau 
> > wrote:
> > >
> > >> Are there any other general comments here? If not, let's get this done
> > and
> > >> merged.
> > >>
> > >> On Mon, Apr 15, 2019, 4:19 PM Siddharth Teotia 
> > >> wrote:
> > >>
> > >> > I believe reader/writer indexes are typically used when we send
> > buffers
> > >> > over the wire -- so may not be necessary for all users of
> ArrowBuf.  I
> > >> am
> > >> > okay with the idea of providing a simple wrapper to ArrowBuf to
> manage
> > >> the
> > >> > reader/writer indexes with a couple of APIs. Note that some APIs
> like
> > >> > writeInt, writeLong() bump the writer index unlike setInt/setLong
> > >> > counterparts. JsonFileReader uses some of these APIs.
> > >> >
> > >> >
> > >> >
> > >> > On Sat, Apr 13, 2019 at 2:42 PM Jacques Nadeau 
> > >> wrote:
> > >> >
> > >> > > Hey Sidd,
> > >> > >
> > >> > > Thanks for pulling this together. This looks very promising. One
> > quick
> > >> > > thought: do we think the concept of the reader and writer index
> need
> > >> to
> > >> > be
> > >> > > on ArrowBuf? It seems like something that could be added as an
> > >> additional
> > >> > > decoration/wrapper when needed instead of being part of the core
> > >> > structure.
> > >> > >
> > >> > > On Sat, Apr 13, 2019 at 11:26 AM Siddharth Teotia <
> > >> siddha...@dremio.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Hi All,
> > >> > > >
> > >> > > > I have put a PR with WIP changes. All the major set of changes
> > have
> > >> > been
> > >> > > > done to decouple the usage of ArrowBuf and reference management.
> > The
> > >> > > > ArrowBu

Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-05-03 Thread Antoine Pitrou


Le 03/05/2019 à 17:57, Jed Brown a écrit :
> 
>>> The library is then free to use constructs like omp taskgroup/taskloop
>>> as granularity warrants; it will never utilize threads that the
>>> application didn't explicitly give it.
>>
>> I don't think we're planning to use OpenMP in Arrow, though Wes probably
>> has a better answer.
> 
> I was just using it to demonstrate the semantic.  Regardless of what
> Arrow uses internally, there will be a cohort of users who are
> interested in using Arrow with OpenMP.

I know next to nothing about OpenMP, but we have some code that's
supposed to enable cooperation with OpenMP here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/thread-pool.cc#L268

If that doesn't work as intended, feel free to open an issue and
describe the problem.

Regards

Antoine.


RE: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-05-03 Thread Malakhov, Anton
Thanks for your answers,

> -Original Message-
> From: Antoine Pitrou [mailto:anto...@python.org]
> Sent: Friday, May 3, 2019 03:54

> Le 03/05/2019 à 05:47, Jed Brown a écrit :
> > I would caution to please not commit to the MKL/BLAS model in which
I'm actually talking about threading layers model where MKL supports several 
OpenMP runtimes (Intel, GNU, PGI) and TBB, as well as non-threaded version. It 
even supports dynamic selection, please refer to: 
https://software.intel.com/en-us/mkl-macos-developer-guide-dynamically-selecting-the-interface-and-threading-layer
The same approach we implemented in Numba (#2245):  
https://numba.pydata.org/numba-doc/dev/user/threading-layer.html

> > the library creates threads internally.  It's a disaster for managing
> > oversubscription and affinity issues among groups of threads and/or
> > multiple processes (e.g., MPI).
This is exactly what I'm talking about referring as issues with threading 
composability! OpenMP is not easy to have inside a library. I described it in 
this document: 
https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+Engine

> Implicit multi-threading is important for user-friendliness reasons 
> (especially in
> higher-level bindings such as the Python-bindings).
Cannot agree more! There might be not enough parallelism on the application 
level, adding parallelism from DSLs is important for better CPU utilization but 
it is also tricky because of these incompatibility issues.

> > The library is then free to use constructs like omp taskgroup/taskloop
> > as granularity warrants; it will never utilize threads that the
> > application didn't explicitly give it.
> 
> I don't think we're planning to use OpenMP in Arrow, though Wes probably has a
> better answer.
I'd not exclude OpenMP from the consideration completely. I want to start with 
TBB but nothing composes better with OpenMP as OpenMP itself. The same MKL 
(i.e. Numpy) defaults to OpenMP threading. BTW, there is no more compatibility 
layer between TBB and OpenMP, it was removed from the latter.


> -Original Message-
> From: Antoine Pitrou [mailto:anto...@python.org]
> Sent: Friday, May 3, 2019 03:49
> 
> Another possibility is to look at our C++ CSV reader and parser (in
> src/arrow/csv).  It's the only piece of Arrow that uses non-trivial 
> multi-threading
> right now (with tasks spawning new tasks dynamically, see
> InferringColumnBuilder).  It's based on the ThreadPool and TaskGroup APIs (in
> src/arrow/util/).  These APIs are not set in stone, so you're free to propose
> changes to make them fit better with a TBB-based implementation.
Great! This is what I was looking for!


// Anton



Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-05-03 Thread Jed Brown
Antoine Pitrou  writes:

> Hi Jed,
>
> Le 03/05/2019 à 05:47, Jed Brown a écrit :
>> I would caution to please not commit to the MKL/BLAS model in which the
>> library creates threads internally.  It's a disaster for managing
>> oversubscription and affinity issues among groups of threads and/or
>> multiple processes (e.g., MPI).
>
> Implicit multi-threading is important for user-friendliness reasons
> (especially in higher-level bindings such as the Python-bindings).

I would argue that can be tucked into bindings versus making it
all-or-nothing in the C++ interface.  It's at least worthy of
discussion.

>> The library is then free to use constructs like omp taskgroup/taskloop
>> as granularity warrants; it will never utilize threads that the
>> application didn't explicitly give it.
>
> I don't think we're planning to use OpenMP in Arrow, though Wes probably
> has a better answer.

I was just using it to demonstrate the semantic.  Regardless of what
Arrow uses internally, there will be a cohort of users who are
interested in using Arrow with OpenMP.


[jira] [Created] (ARROW-5253) [C++] external Snappy fails on Alpine

2019-05-03 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5253:
-

 Summary: [C++] external Snappy fails on Alpine
 Key: ARROW-5253
 URL: https://issues.apache.org/jira/browse/ARROW-5253
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Francois Saint-Jacques
 Fix For: 0.14.0



{code:bash}
FAILED: debug/libarrow.so.14.0.0 
: && /usr/bin/c++ -fPIC -Wno-noexcept-type  -fdiagnostics-color=always -ggdb 
-O0  -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror 
-msse4.2  -g  
-Wl,--version-script=/buildbot/amd64-alpine-3_9-cpp/cpp/src/arrow/symbols.map 
-shared -Wl,-soname,libarrow.so.14 -o debug/libarrow.so.14.0.0 
...
c++: error: snappy_ep/src/snappy_ep-install/lib/libsnappy.a: No such file or 
directory
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5252) [C++] Change variant implementation

2019-05-03 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5252:
-

 Summary: [C++] Change variant implementation
 Key: ARROW-5252
 URL: https://issues.apache.org/jira/browse/ARROW-5252
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.13.0
Reporter: Antoine Pitrou


Our vendored variant implementation, [Mapbox 
variant|https://github.com/mapbox/variant], does not provide the same API as 
the official [C++17 variant 
class|https://en.cppreference.com/w/cpp/utility/variant].

We could / should switch to an implementation that follows the C++17 API, such 
as https://github.com/mpark/variant or 
https://github.com/martinmoene/variant-lite .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-05-03 Thread Antoine Pitrou


Hi Jed,

Le 03/05/2019 à 05:47, Jed Brown a écrit :
> I would caution to please not commit to the MKL/BLAS model in which the
> library creates threads internally.  It's a disaster for managing
> oversubscription and affinity issues among groups of threads and/or
> multiple processes (e.g., MPI).

Implicit multi-threading is important for user-friendliness reasons
(especially in higher-level bindings such as the Python-bindings).

> The library is then free to use constructs like omp taskgroup/taskloop
> as granularity warrants; it will never utilize threads that the
> application didn't explicitly give it.

I don't think we're planning to use OpenMP in Arrow, though Wes probably
has a better answer.

Regards

Antoine.


Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-05-03 Thread Antoine Pitrou


Hi Anton,

Another possibility is to look at our C++ CSV reader and parser (in
src/arrow/csv).  It's the only piece of Arrow that uses non-trivial
multi-threading right now (with tasks spawning new tasks dynamically,
see InferringColumnBuilder).  It's based on the ThreadPool and TaskGroup
APIs (in src/arrow/util/).  These APIs are not set in stone, so you're
free to propose changes to make them fit better with a TBB-based
implementation.

Regards

Antoine.


Le 03/05/2019 à 01:42, Malakhov, Anton a écrit :
> Thanks Wes!
> 
> Sounds like a good way to go! We'll create a demo, as you suggested, 
> implementing a parallel execution model for a simple analytics pipeline that 
> reads and processes the files. My only concern is about adding more pipeline 
> breaker nodes and compute intensive operations into this demo because min/max 
> are effectively no-ops fused into I/O scan node. What do you think about 
> adding group-by into this picture, effectively implementing NY taxi and/or 
> mortgage benchmarks? Ideally, I'd like to go even further and add sci-kit 
> learn-like stuff for processing that data in order to demonstrate the 
> co-existence side of the story. What do you think?
> So, the idea of the prototype will be to validate the parallel execution 
> model as the first step. After that, it'll help to shape API for both - 
> execution nodes and the threading backend. Does it sound right to you?
> 
> P.S. I can well understand your hesitation about using TBB directly and as 
> non-optional dependency, thus I'm suggesting threading layers approach here. 
> Please let me clarify myself, using TBB and nested parallelism is non-goal by 
> itself. The goal is to build components of efficient execution model, which 
> coexist well with each other and with all the other, external to Arrow, 
> components of an applications. However, without a rich, composable, and 
> mature parallel toolkit, it is hard to achieve and to focus on this goal. 
> Thus, I wanted to check with the community if it is an acceptable way at all 
> and what's the roadmap.
> 
> Thanks,
> // Anton
> 
> 
> -Original Message-
> From: Wes McKinney [mailto:wesmck...@gmail.com] 
> Sent: Thursday, May 2, 2019 13:52
> To: dev@arrow.apache.org
> Subject: Re: [DISCUSS][C++][Proposal] Threading engine for Arrow
> 
> hi Anton,
> 
> Thank you for bringing your expertise to the project -- this is a very useful 
> discussion to have.
> 
> Partly why our threading capabilities in the project are not further 
> developed is that there is not much that needs to be parallelized. It would 
> be like designing a supercharger when you don't have a car yet.
> That being said, it is worthwhile to plan ahead so we aren't trying to 
> retrofit significant pieces of software to be able to take advantage of a 
> more advanced task scheduler.
> 
> From my perspective, we have a few key practical areas of consideration:
> 
> * Computational tasks that may offer nested parallelism (e.g. an Aggregation 
> or Projection task may be able to execution in multiple
> threads)
> * IO operations performed from within tasks that appear to be computational 
> in nature (example: in the course of reading a Parquet file, both computation 
> -- decoding, decompression -- and IO -- local or remote filesystem operations 
> -- must be performed). The status quo right now is that IO performed inside a 
> task in the thread pool is not releasing any resources to other tasks.
> 
> I believe that we should design and develop a sane programming model / API 
> for implementing our software in the presence of these challenges.
> If the backend / implementation of this API uses TBB and that makes things 
> more efficient than other approaches, then that sounds great to me. I would 
> be hesitant to use TBB APIs directly in Arrow application code unless it can 
> be clearly demonstrated by that is a superior option to alternatives.
> 
> It seems useful to validate the implementation approach by starting with some 
> practical problems. Suppose, for the sake of argument, you want to read 10 
> Parquet files (constituting a single logical dataset) as fast as possible and 
> perform some simple analytics on them -- let's take something very simple 
> like computing the maximum and minimum values of each column in the dataset. 
> This problem features both problems listed above:
> 
> * Reading a single Parquet file can be parallelized (by columns -- since 
> columns can be decoded in parallel) on the global thread pool, so reading 
> multiple files in parallel would cause nested parallelism
> * Within the context of reading a single Parquet file column, IO calls are 
> performed. CPU threads sit idle while this IO is taking place, particularly 
> if the file system is high latency (e.g. HDFS)
> 
> What do you think about -- as a way of moving this project forward -- 
> developing a prototype threading backend and developer API (for people like 
> me to use to develop libraries like the P