RecordBatch.length vs. Buffer.length?
Hello! I'm using the Java API for Arrow and am finding some ambiguity between the length field in a RecordBatch and the "byte-width-adjusted" length field in a Buffer. As per https://arrow.apache.org/docs/format/Metadata.html under the "Record data headers" section: "A record batch is a collection of top-level named, equal length Arrow arrays (or vectors)." This seems to correspond to org.apache.arrow.flatbuf.RecordBatch.length() when reading and VectorSchemaRoot.setRowCount() when writing. In addition to this field, each array buffer has its own specific length in bytes. As a library developer (particularly on the consumer side), what is the proper behavior when these two numbers don't match or when array lengths don't match each other? For example, I can use the ArrowFileWriter to create a two-column file where I setRowCount to 8, add 100 ints to the first column and 300 ints to the second column and everything seems to "work" fine even though this doesn't seem to be internally consistent. If these various length fields are supposed to correspond to each other / represent the same thing, then having two different accounts of the same value seems error-prone and ambiguous. Why does the format not exclusively use RecordBatch.length combined with each array's bitWidth? The product of the two seems like it should be equivalent to Buffer.length. As such, I think I must be missing something and am looking for more clarity on how to think about and process RecordBatch.length and Buffer.length (once I divide by bytesPerElement). Thanks.
Re: [VOTE] Add new DurationInterval Type to Arrow Format
I've just reviewed the format and C++ changes in https://github.com/apache/arrow/pull/3644 which look good to me modulo minor comments. Can someone take a look at the Java changes soon so we move this toward completion? One question came up of whether "DurationInterval" is the name we want. It might be more clear to call it simply "Duration" On Tue, Apr 30, 2019 at 6:57 PM Micah Kornfield wrote: > > Sorry for the type OK, I think https://github.com/apache/arrow/pull/3644 is > now ready to review. > > On Tue, Apr 30, 2019 at 4:56 PM Micah Kornfield > wrote: > > > OK, I think https://github.com/apache/arrow/pull/3644 is no ready to > > review. > > > > It includes Java implementation of DurationInterval and C++ > > implementations of DurationInterval and the original interval types. I > > added documentation to Schema.fbs regarding the original interval types > > (TL;DR; YEAR_MONTH is expected to be supported by all implementations > > DAY_TIME is not, which I believe as based on previous ML conversations). > > Please let me know if there are issues with this language and I can remove > > it. > > > > > > On Monday, April 8, 2019, Krisztián Szűcs > > wrote: > > > >> The vote carries with 4 binding +1 votes. > >> > >> Micah, what are the next steps? > >> Are You going to finalize the PR? > >> > >> On Sun, Apr 7, 2019 at 11:13 AM Uwe L. Korn wrote: > >> > >> > +1 (binding) > >> > > >> > On Sat, Apr 6, 2019, at 2:44 AM, Kouhei Sutou wrote: > >> > > +1 (binding) > >> > > > >> > > In >> p...@mail.gmail.com> > >> > > "[VOTE] Add new DurationInterval Type to Arrow Format" on Wed, 3 Apr > >> > > 2019 07:59:56 -0700, > >> > > Jacques Nadeau wrote: > >> > > > >> > > > I'd like to propose a change to the Arrow format to support a new > >> > duration > >> > > > type. Details below. Threads on mailing list around discussion. > >> > > > > >> > > > > >> > > > // An absolute length of time unrelated to any calendar artifacts. > >> > For the > >> > > > purposes > >> > > > /// of Arrow Implementations, adding this value to a Timestamp > >> ("t1") > >> > > > naively (i.e. simply summing > >> > > > /// the two number) is acceptable even though in some cases the > >> > resulting > >> > > > Timestamp (t2) would > >> > > > /// not account for leap-seconds during the elapsed time between > >> "t1" > >> > and > >> > > > "t2". Similarly, representing > >> > > > /// the difference between two Unix timestamp is acceptable, but > >> would > >> > > > yield a value that is possibly a few seconds > >> > > > /// off from the true elapsed time. > >> > > > /// > >> > > > /// The resolution defaults to > >> > > > /// millisecond, but can be any of the other supported TimeUnit > >> values > >> > as > >> > > > /// with Timestamp and Time types. This type is always represented > >> as > >> > > > /// an 8-byte integer. > >> > > > table DurationInterval { > >> > > >unit: TimeUnit = MILLISECOND; > >> > > > } > >> > > > > >> > > > > >> > > > Please vote whether to accept the changes. The vote will be open > >> > > > for at least 72 hours. > >> > > > > >> > > > [ ] +1 Accept these changes to the Flight protocol > >> > > > [ ] +0 > >> > > > [ ] -1 Do not accept the changes because... > >> > > > >> > > >> > >
[jira] [Created] (ARROW-5257) [Website] Update site to use "official" Apache Arrow logo, add clearly marked links to logo
Wes McKinney created ARROW-5257: --- Summary: [Website] Update site to use "official" Apache Arrow logo, add clearly marked links to logo Key: ARROW-5257 URL: https://issues.apache.org/jira/browse/ARROW-5257 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Wes McKinney Fix For: 0.14.0 See logo at https://docs.google.com/presentation/d/1qmvPpFU7sdm9l6A6LEyI0zIzswGtJW0Sbd_lfHLaXQs/edit#slide=id.g4258234456_0_1 An unofficial logo lacking the "Apache" has been making the rounds on the internet, so I think it would be a good idea to update our web properties with the approved logo as discussed on the mailing list Whoever does this task -- please make sure to compress the PNG asset of the logo prior to checking in to source control -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5256) [Packaging][deb] Failed to build with LLVM 7.1.0
Kouhei Sutou created ARROW-5256: --- Summary: [Packaging][deb] Failed to build with LLVM 7.1.0 Key: ARROW-5256 URL: https://issues.apache.org/jira/browse/ARROW-5256 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva, Packaging Reporter: Kouhei Sutou https://travis-ci.org/ursa-labs/crossbow/builds/527710714#L6144-L6157 {noformat} CMake Error at cmake_modules/FindLLVM.cmake:33 (find_package): Could not find a configuration file for package "LLVM" that is compatible with requested version "7.0". The following configuration files were considered but not accepted: /usr/lib/llvm-7/cmake/LLVMConfig.cmake, version: 7.1.0 /usr/lib/llvm-7/lib/cmake/llvm/LLVMConfig.cmake, version: 7.1.0 /usr/lib/llvm-7/share/llvm/cmake/LLVMConfig.cmake, version: 7.1.0 /usr/lib/llvm-3.8/share/llvm/cmake/LLVMConfig.cmake, version: 3.8.1 /usr/share/llvm-3.8/cmake/LLVMConfig.cmake, version: 3.8.1 Call Stack (most recent call first): src/gandiva/CMakeLists.txt:31 (find_package) {noformat} Can we use "7" instead of "7.0" for {{ARROW_LLVM_VERSION}}? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: How about inet4/inet6/macaddr data types?
Sure, I've created https://issues.apache.org/jira/browse/ARROW-5255. PR: https://github.com/apache/arrow/pull/4251 I'm not sure if what I'm doing with my Vector subclass is quite right, but we'd especially like this in Java, so happy to work through any feedback. Also, as part of this discussion, I think the original C++ implementation noted that this metadata would not round-trip through Pandas. We would definitely like that feature if possible - maybe column-level metadata could be saved under a special field in the dataframe-level Pandas metadata? Best, David On 5/3/19, Wes McKinney wrote: > hi David -- would you like to open a PR and corresponding JIRA issue > for discussion? We might want to hold a vote to formalize the > extension type mechanism (and to fix the metadata names -- I agree > that having an ARROW namespace would be better than what we are doing > now) > > On Thu, May 2, 2019 at 7:02 AM David Li wrote: >> >> Re: Java support, I've sketched out an implementation: >> https://github.com/lihalite/arrow/pull/2 >> >> On 5/1/19, Micah Kornfield wrote: >> >> >> >> I'm awaiting community feedback about the approach to implementing >> >> extension types, whether the approach that I've used (using the >> >> following keys in custom_metadata [1]) is the one that we want to use >> >> longer-term. This certainly seems like a good time to have that >> >> discussion. If there is consensus then we can document it formally in >> >> the specification documents, and we probably will want to hold a vote >> >> to ensure that we are in agreement. >> >> >> > >> > Please let me know if this is best on a separate thread. I think I >> > would >> > feel more comfortable finalizing this if we had a few more examples >> > exercising the functionality. Inet, seems like a complicated enough >> > use-case for modeling which would make it a good use-case (It seems like >> > it >> > might involve a struct/union?). I also presume we will need a Java >> > implementation, before we finalize anything? >> > >> > A small amount of bikeshedding on key names: We should probably take a >> > namespace reservation approach for custom metadata in Schema.fbs [1]. >> > In >> > this regard I have a small preference for something reserving all >> > metadata >> > with something like "ARROW:" or "ARROW." (not an >> > underscore, and I'm open to different capitalization.) This seems to be >> > a >> > similar approach to how avro reserves metadata keys [2]. >> > >> > [1] >> > https://github.com/apache/arrow/blob/b8aeb79e94a5a507aeec55d0b6c6bf5d7f0100b2/format/Schema.fbs#L264 >> > [2] https://avro.apache.org/docs/1.8.1/spec.html >> > >
[jira] [Created] (ARROW-5255) [Java] Implement user-defined data types API
David Li created ARROW-5255: --- Summary: [Java] Implement user-defined data types API Key: ARROW-5255 URL: https://issues.apache.org/jira/browse/ARROW-5255 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: David Li -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5254) [Flight][Java] DoAction does not support result streams
David Li created ARROW-5254: --- Summary: [Flight][Java] DoAction does not support result streams Key: ARROW-5254 URL: https://issues.apache.org/jira/browse/ARROW-5254 Project: Apache Arrow Issue Type: Bug Components: FlightRPC, Java Reporter: David Li Assignee: David Li Fix For: 0.14.0 While Flight defines DoAction as returning a stream of results, the Java APIs only allow returning a single result. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: How about inet4/inet6/macaddr data types?
hi David -- would you like to open a PR and corresponding JIRA issue for discussion? We might want to hold a vote to formalize the extension type mechanism (and to fix the metadata names -- I agree that having an ARROW namespace would be better than what we are doing now) On Thu, May 2, 2019 at 7:02 AM David Li wrote: > > Re: Java support, I've sketched out an implementation: > https://github.com/lihalite/arrow/pull/2 > > On 5/1/19, Micah Kornfield wrote: > >> > >> I'm awaiting community feedback about the approach to implementing > >> extension types, whether the approach that I've used (using the > >> following keys in custom_metadata [1]) is the one that we want to use > >> longer-term. This certainly seems like a good time to have that > >> discussion. If there is consensus then we can document it formally in > >> the specification documents, and we probably will want to hold a vote > >> to ensure that we are in agreement. > >> > > > > Please let me know if this is best on a separate thread. I think I would > > feel more comfortable finalizing this if we had a few more examples > > exercising the functionality. Inet, seems like a complicated enough > > use-case for modeling which would make it a good use-case (It seems like it > > might involve a struct/union?). I also presume we will need a Java > > implementation, before we finalize anything? > > > > A small amount of bikeshedding on key names: We should probably take a > > namespace reservation approach for custom metadata in Schema.fbs [1]. In > > this regard I have a small preference for something reserving all metadata > > with something like "ARROW:" or "ARROW." (not an > > underscore, and I'm open to different capitalization.) This seems to be a > > similar approach to how avro reserves metadata keys [2]. > > > > [1] > > https://github.com/apache/arrow/blob/b8aeb79e94a5a507aeec55d0b6c6bf5d7f0100b2/format/Schema.fbs#L264 > > [2] https://avro.apache.org/docs/1.8.1/spec.html > >
Re: PARQUET-1411 / PR 4185
No need for apologies, Wes, I appreciate you keeping this on your radar. I've made the changes and have pushed them to the PR branch. You can begin your review when you get the chance. --TPB On Thu, May 2, 2019 at 3:32 PM Wes McKinney wrote: > + Parquet dev list > > Thanks Tim for working on this issue, I'm sorry I haven't been able to > leave code review yet -- I've been busy with a bunch of other things > and, since it's a large patch, I wanted to give thoughtful feedback. > > Feel free to push some more commits to that PR. I can prioritize > getting you some feedback in the next couple of working days, just let > me know when you're ready for me to review. > > On Thu, May 2, 2019 at 5:23 PM TP Boudreau wrote: > > > > Hello Parquet-Arrow Team, Wes, > > > > A short while ago, I submitted PR 4185 ( > > https://github.com/apache/arrow/pull/4185) to implement in the C++ > library > > the new logical annotations metadata available in the latest > parquet.thrift > > spec (https://issues.apache.org/jira/browse/PARQUET-1411). I stopped > > committing to that PR's branch about a week ago to allow the code to be > > reviewed without it being a moving target. > > > > I've since (optimistically) starting blocking out new code for ARROW-3729 > > based on my open PR (switching Arrow to read/write the new Parquet > > annotations, https://issues.apache.org/jira/browse/ARROW-3729) and while > > doing that realized that usage of the annotations classes I created in > the > > open PR might be smoother with the introduction of a few convenience > > methods. However, the most suitable names for these methods (IMO) were > > introduced for another purpose in the open PR and would need to be > > reclaimed -- overall a fairly minor, non-structural change to the PR > code. > > > > I can either add another commit to the open PR to add these convenience > > methods and rename some things (which would be my preference, provided no > > one has invested too much time yet reviewing it -- maybe you have Wes?), > or > > I can wait for the first round of reviews on that PR to see where things > > stand. > > > > How should I proceed? > > > > Thanks in advance, > > --Tim >
RE: [DISCUSS][C++][Proposal] Threading engine for Arrow
"Malakhov, Anton" writes: >> > the library creates threads internally. It's a disaster for managing >> > oversubscription and affinity issues among groups of threads and/or >> > multiple processes (e.g., MPI). > > This is exactly what I'm talking about referring as issues with threading > composability! OpenMP is not easy to have inside a library. I described it in > this document: > https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+Engine Thanks for this document. I'm no great fan of OpenMP, but it's being billed by most vendors (especially Intel) as the way to go in the scientific computing space and has become relatively popular (much more so than TBB). You linked to a NumPy discussion (https://github.com/numpy/numpy/issues/11826) that is encountering the same issues, but proposing solutions based on the global environment. That is perhaps acceptable for typical Python callers due to the GIL, but C++ callers may be using threads themselves. A typical example: App: calls libB sequentially: calls Arrow sequentially (wants to use threads) calls libC sequentially: omp parallel (creates threads somehow): calls Arrow from threads (Arrow should not create more) omp parallel: calls libD from threads: calls Arrow (Arrow should not create more) Arrow doesn't need to know the difference between the libC and libD cases, but it may make a difference to the implementation of those libraries. In both of these cases, the user may desire that Arrow create tasks for load balancing reasons (but no new threads) so long as they can run on the specified thread team. I have yet to see a complete solution to this problem, but we should work out which modes are worth supporting and how that interface would look. Global solutions like this one (linked by Antoine) https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/thread-pool.cc#L268 imply that threading mode is global and set via an environment variable, neither of which are true in cases such as the above (and many simpler cases).
Re: ARROW-3191: Making ArrowBuf work with arbitrary memory and setting io.netty.tryReflectionSetAccessible to true for java builds
Hi Sidd, Does setting the system property io.netty.tryReflectionSetAccessible to true have any other adverse effect other than those warnings during build? Bryan On Thu, May 2, 2019 at 8:43 PM Jacques Nadeau wrote: > I'm onboard with this change. > > On Fri, Apr 26, 2019 at 2:14 AM Siddharth Teotia > wrote: > > > As part of working on this patch < > > https://github.com/apache/arrow/pull/4151>, > > I ran into a problem with jdk 9 and 11 builds. Since memory underlying > > ArrowBuf may not necessarily be a ByteBuf (or any of its extensions), > > methods like nioBuffer() can no longer be delegated as > > UnsafeDirectLittleEndian.nioBuffer() to Netty implementation. > > > > So I used PlatformDependent.directBuffer(memory address, size) to create > a > > direct Byte Buffer to closely mimic what Netty was originally doing > > underneath for nioBuffer(). It turns out that PlatformDependent code in > > netty first checks for the existence of constructor DirectByteBuffer(long > > address, int size) as seen here > > < > > > https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L223 > > >. > > The constructor (long address, int size) is very well available in jdk > 8, 9 > > and 11 but on the next line it tries to set it accessible. The reflection > > based access is disabled by default in netty code for jdk >= 9 as seen > here > > < > > > https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L829 > > >. > > Thus calls to PlatformDependent.directBuffer(address, size) were failing > in > > travis CI builds for JDK 9 and 11 with UnsupportedOperationException as > > seen here > > < > > > https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent.java#L415 > > > > > and > > this was because of the decision that was taken by netty at startup w.r.t > > whether to provide access to constructor or not. > > > > We should set io.netty.tryReflectionSetAccessible system property to true > > in java root pom > > > > I want to make sure people are aware and agree/disagree with this change. > > > > The tests now give the following warning: > > > > WARNING: An illegal reflective access operation has occurred > > WARNING: Illegal reflective access by > io.netty.util.internal.ReflectionUtil > > > > > (file:/Users/siddharthteotia/.m2/repository/io/netty/netty-common/4.1.22.Final/netty-common-4.1.22.Final.jar) > > to constructor java.nio.DirectByteBuffer(long,int) > > WARNING: Please consider reporting this to the maintainers of > > io.netty.util.internal.ReflectionUtil > > WARNING: Use --illegal-access=warn to enable warnings of further illegal > > reflective access operations > > WARNING: All illegal access operations will be denied in a future release > > > > Thanks. > > On Thu, Apr 18, 2019 at 3:39 PM Siddharth Teotia > > wrote: > > > > > I have made all the necessary changes in java code to work with new > > > ArrowBuf, ReferenceManager interfaces. More importantly, there is a > > wrapper > > > buffer NettyArrowBuf interface to comply with usage in RPC and Netty > > > related code. It will be good to get feedback on this one (and of > course > > > all other changes). As of now, the java modules build fine but I have > to > > > fix test failures. That is in progress. > > > > > > On Wed, Apr 17, 2019 at 6:41 AM Jacques Nadeau > > wrote: > > > > > >> Are there any other general comments here? If not, let's get this done > > and > > >> merged. > > >> > > >> On Mon, Apr 15, 2019, 4:19 PM Siddharth Teotia > > >> wrote: > > >> > > >> > I believe reader/writer indexes are typically used when we send > > buffers > > >> > over the wire -- so may not be necessary for all users of > ArrowBuf. I > > >> am > > >> > okay with the idea of providing a simple wrapper to ArrowBuf to > manage > > >> the > > >> > reader/writer indexes with a couple of APIs. Note that some APIs > like > > >> > writeInt, writeLong() bump the writer index unlike setInt/setLong > > >> > counterparts. JsonFileReader uses some of these APIs. > > >> > > > >> > > > >> > > > >> > On Sat, Apr 13, 2019 at 2:42 PM Jacques Nadeau > > >> wrote: > > >> > > > >> > > Hey Sidd, > > >> > > > > >> > > Thanks for pulling this together. This looks very promising. One > > quick > > >> > > thought: do we think the concept of the reader and writer index > need > > >> to > > >> > be > > >> > > on ArrowBuf? It seems like something that could be added as an > > >> additional > > >> > > decoration/wrapper when needed instead of being part of the core > > >> > structure. > > >> > > > > >> > > On Sat, Apr 13, 2019 at 11:26 AM Siddharth Teotia < > > >> siddha...@dremio.com> > > >> > > wrote: > > >> > > > > >> > > > Hi All, > > >> > > > > > >> > > > I have put a PR with WIP changes. All the major set of changes > > have > > >> > been > > >> > > > done to decouple the usage of ArrowBuf and reference management. > > The > > >> > > > ArrowBu
Re: [DISCUSS][C++][Proposal] Threading engine for Arrow
Le 03/05/2019 à 17:57, Jed Brown a écrit : > >>> The library is then free to use constructs like omp taskgroup/taskloop >>> as granularity warrants; it will never utilize threads that the >>> application didn't explicitly give it. >> >> I don't think we're planning to use OpenMP in Arrow, though Wes probably >> has a better answer. > > I was just using it to demonstrate the semantic. Regardless of what > Arrow uses internally, there will be a cohort of users who are > interested in using Arrow with OpenMP. I know next to nothing about OpenMP, but we have some code that's supposed to enable cooperation with OpenMP here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/thread-pool.cc#L268 If that doesn't work as intended, feel free to open an issue and describe the problem. Regards Antoine.
RE: [DISCUSS][C++][Proposal] Threading engine for Arrow
Thanks for your answers, > -Original Message- > From: Antoine Pitrou [mailto:anto...@python.org] > Sent: Friday, May 3, 2019 03:54 > Le 03/05/2019 à 05:47, Jed Brown a écrit : > > I would caution to please not commit to the MKL/BLAS model in which I'm actually talking about threading layers model where MKL supports several OpenMP runtimes (Intel, GNU, PGI) and TBB, as well as non-threaded version. It even supports dynamic selection, please refer to: https://software.intel.com/en-us/mkl-macos-developer-guide-dynamically-selecting-the-interface-and-threading-layer The same approach we implemented in Numba (#2245): https://numba.pydata.org/numba-doc/dev/user/threading-layer.html > > the library creates threads internally. It's a disaster for managing > > oversubscription and affinity issues among groups of threads and/or > > multiple processes (e.g., MPI). This is exactly what I'm talking about referring as issues with threading composability! OpenMP is not easy to have inside a library. I described it in this document: https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+Engine > Implicit multi-threading is important for user-friendliness reasons > (especially in > higher-level bindings such as the Python-bindings). Cannot agree more! There might be not enough parallelism on the application level, adding parallelism from DSLs is important for better CPU utilization but it is also tricky because of these incompatibility issues. > > The library is then free to use constructs like omp taskgroup/taskloop > > as granularity warrants; it will never utilize threads that the > > application didn't explicitly give it. > > I don't think we're planning to use OpenMP in Arrow, though Wes probably has a > better answer. I'd not exclude OpenMP from the consideration completely. I want to start with TBB but nothing composes better with OpenMP as OpenMP itself. The same MKL (i.e. Numpy) defaults to OpenMP threading. BTW, there is no more compatibility layer between TBB and OpenMP, it was removed from the latter. > -Original Message- > From: Antoine Pitrou [mailto:anto...@python.org] > Sent: Friday, May 3, 2019 03:49 > > Another possibility is to look at our C++ CSV reader and parser (in > src/arrow/csv). It's the only piece of Arrow that uses non-trivial > multi-threading > right now (with tasks spawning new tasks dynamically, see > InferringColumnBuilder). It's based on the ThreadPool and TaskGroup APIs (in > src/arrow/util/). These APIs are not set in stone, so you're free to propose > changes to make them fit better with a TBB-based implementation. Great! This is what I was looking for! // Anton
Re: [DISCUSS][C++][Proposal] Threading engine for Arrow
Antoine Pitrou writes: > Hi Jed, > > Le 03/05/2019 à 05:47, Jed Brown a écrit : >> I would caution to please not commit to the MKL/BLAS model in which the >> library creates threads internally. It's a disaster for managing >> oversubscription and affinity issues among groups of threads and/or >> multiple processes (e.g., MPI). > > Implicit multi-threading is important for user-friendliness reasons > (especially in higher-level bindings such as the Python-bindings). I would argue that can be tucked into bindings versus making it all-or-nothing in the C++ interface. It's at least worthy of discussion. >> The library is then free to use constructs like omp taskgroup/taskloop >> as granularity warrants; it will never utilize threads that the >> application didn't explicitly give it. > > I don't think we're planning to use OpenMP in Arrow, though Wes probably > has a better answer. I was just using it to demonstrate the semantic. Regardless of what Arrow uses internally, there will be a cohort of users who are interested in using Arrow with OpenMP.
[jira] [Created] (ARROW-5253) [C++] external Snappy fails on Alpine
Francois Saint-Jacques created ARROW-5253: - Summary: [C++] external Snappy fails on Alpine Key: ARROW-5253 URL: https://issues.apache.org/jira/browse/ARROW-5253 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.13.0 Reporter: Francois Saint-Jacques Fix For: 0.14.0 {code:bash} FAILED: debug/libarrow.so.14.0.0 : && /usr/bin/c++ -fPIC -Wno-noexcept-type -fdiagnostics-color=always -ggdb -O0 -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror -msse4.2 -g -Wl,--version-script=/buildbot/amd64-alpine-3_9-cpp/cpp/src/arrow/symbols.map -shared -Wl,-soname,libarrow.so.14 -o debug/libarrow.so.14.0.0 ... c++: error: snappy_ep/src/snappy_ep-install/lib/libsnappy.a: No such file or directory {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5252) [C++] Change variant implementation
Antoine Pitrou created ARROW-5252: - Summary: [C++] Change variant implementation Key: ARROW-5252 URL: https://issues.apache.org/jira/browse/ARROW-5252 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.13.0 Reporter: Antoine Pitrou Our vendored variant implementation, [Mapbox variant|https://github.com/mapbox/variant], does not provide the same API as the official [C++17 variant class|https://en.cppreference.com/w/cpp/utility/variant]. We could / should switch to an implementation that follows the C++17 API, such as https://github.com/mpark/variant or https://github.com/martinmoene/variant-lite . -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS][C++][Proposal] Threading engine for Arrow
Hi Jed, Le 03/05/2019 à 05:47, Jed Brown a écrit : > I would caution to please not commit to the MKL/BLAS model in which the > library creates threads internally. It's a disaster for managing > oversubscription and affinity issues among groups of threads and/or > multiple processes (e.g., MPI). Implicit multi-threading is important for user-friendliness reasons (especially in higher-level bindings such as the Python-bindings). > The library is then free to use constructs like omp taskgroup/taskloop > as granularity warrants; it will never utilize threads that the > application didn't explicitly give it. I don't think we're planning to use OpenMP in Arrow, though Wes probably has a better answer. Regards Antoine.
Re: [DISCUSS][C++][Proposal] Threading engine for Arrow
Hi Anton, Another possibility is to look at our C++ CSV reader and parser (in src/arrow/csv). It's the only piece of Arrow that uses non-trivial multi-threading right now (with tasks spawning new tasks dynamically, see InferringColumnBuilder). It's based on the ThreadPool and TaskGroup APIs (in src/arrow/util/). These APIs are not set in stone, so you're free to propose changes to make them fit better with a TBB-based implementation. Regards Antoine. Le 03/05/2019 à 01:42, Malakhov, Anton a écrit : > Thanks Wes! > > Sounds like a good way to go! We'll create a demo, as you suggested, > implementing a parallel execution model for a simple analytics pipeline that > reads and processes the files. My only concern is about adding more pipeline > breaker nodes and compute intensive operations into this demo because min/max > are effectively no-ops fused into I/O scan node. What do you think about > adding group-by into this picture, effectively implementing NY taxi and/or > mortgage benchmarks? Ideally, I'd like to go even further and add sci-kit > learn-like stuff for processing that data in order to demonstrate the > co-existence side of the story. What do you think? > So, the idea of the prototype will be to validate the parallel execution > model as the first step. After that, it'll help to shape API for both - > execution nodes and the threading backend. Does it sound right to you? > > P.S. I can well understand your hesitation about using TBB directly and as > non-optional dependency, thus I'm suggesting threading layers approach here. > Please let me clarify myself, using TBB and nested parallelism is non-goal by > itself. The goal is to build components of efficient execution model, which > coexist well with each other and with all the other, external to Arrow, > components of an applications. However, without a rich, composable, and > mature parallel toolkit, it is hard to achieve and to focus on this goal. > Thus, I wanted to check with the community if it is an acceptable way at all > and what's the roadmap. > > Thanks, > // Anton > > > -Original Message- > From: Wes McKinney [mailto:wesmck...@gmail.com] > Sent: Thursday, May 2, 2019 13:52 > To: dev@arrow.apache.org > Subject: Re: [DISCUSS][C++][Proposal] Threading engine for Arrow > > hi Anton, > > Thank you for bringing your expertise to the project -- this is a very useful > discussion to have. > > Partly why our threading capabilities in the project are not further > developed is that there is not much that needs to be parallelized. It would > be like designing a supercharger when you don't have a car yet. > That being said, it is worthwhile to plan ahead so we aren't trying to > retrofit significant pieces of software to be able to take advantage of a > more advanced task scheduler. > > From my perspective, we have a few key practical areas of consideration: > > * Computational tasks that may offer nested parallelism (e.g. an Aggregation > or Projection task may be able to execution in multiple > threads) > * IO operations performed from within tasks that appear to be computational > in nature (example: in the course of reading a Parquet file, both computation > -- decoding, decompression -- and IO -- local or remote filesystem operations > -- must be performed). The status quo right now is that IO performed inside a > task in the thread pool is not releasing any resources to other tasks. > > I believe that we should design and develop a sane programming model / API > for implementing our software in the presence of these challenges. > If the backend / implementation of this API uses TBB and that makes things > more efficient than other approaches, then that sounds great to me. I would > be hesitant to use TBB APIs directly in Arrow application code unless it can > be clearly demonstrated by that is a superior option to alternatives. > > It seems useful to validate the implementation approach by starting with some > practical problems. Suppose, for the sake of argument, you want to read 10 > Parquet files (constituting a single logical dataset) as fast as possible and > perform some simple analytics on them -- let's take something very simple > like computing the maximum and minimum values of each column in the dataset. > This problem features both problems listed above: > > * Reading a single Parquet file can be parallelized (by columns -- since > columns can be decoded in parallel) on the global thread pool, so reading > multiple files in parallel would cause nested parallelism > * Within the context of reading a single Parquet file column, IO calls are > performed. CPU threads sit idle while this IO is taking place, particularly > if the file system is high latency (e.g. HDFS) > > What do you think about -- as a way of moving this project forward -- > developing a prototype threading backend and developer API (for people like > me to use to develop libraries like the P