Re: OversizedAllocationException for pandas_udf in pyspark
Hi, >From the error it looks like this might potentially be some sort of integer overflow, but it is hard to say. Could you try to get a minimal reproduction of the error [1] , and open a JIRA Issue [2] with it? Thanks, Micah [1] https://stackoverflow.com/help/mcve [2] https://issues.apache.org On Sunday, March 10, 2019, Abdeali Kothari wrote: > Hi, any help on this would be much appreciated. > I've not been able to figure out any reason for this to happen yet > > On Sat, Mar 2, 2019, 11:50 Abdeali Kothari > wrote: > > > Hi Li Jin, thanks for the note. > > > > I get this error only for larger data - when I reduce the number of > > records or the number or columns in my data it all works fine - so if it > is > > binary incompatibility it should be something related to large data. > > I am using Spark 2.3.1 on Amazon EMR for this testing. > > https://github.com/apache/spark/blob/v2.3.1/pom.xml#L192 seems to > > indicate arrow version is 0.8 for this. > > > > I installed pyarrow-0.8.0 in the python environment on my cluster with > pip > > and I am still getting this error. > > The stacktrace is very similar, just some lines moved in the pxi files: > > > > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > > recent call last): > > File > > > "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0018/container_1551469777576_0018_01_02/pyspark.zip/pyspark/worker.py", > > line 230, in main > > process() > > File > > > "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0018/container_1551469777576_0018_01_02/pyspark.zip/pyspark/worker.py", > > line 225, in process > > serializer.dump_stream(func(split_index, iterator), outfile) > > File > > > "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0018/container_1551469777576_0018_01_02/pyspark.zip/pyspark/serializers.py", > > line 260, in dump_stream > > for series in iterator: > > File > > > "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0018/container_1551469777576_0018_01_02/pyspark.zip/pyspark/serializers.py", > > line 279, in load_stream > > for batch in reader: > > File "pyarrow/ipc.pxi", line 268, in __iter__ > > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:70278) > > File "pyarrow/ipc.pxi", line 284, in > > pyarrow.lib._RecordBatchReader.read_next_batch > > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:70534) > > File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status > > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8345) > > pyarrow.lib.ArrowIOError: read length must be positive or -1 > > > > Other notes: > > - My data is just integers, strings, and doubles. No complex types like > > arrays/maps/etc. > > - I don't have any NULL/None values in my data > > - Increasing executor-memory for spark does not seem to help here > > > > As always: Any thoughts or notes would be great so I can get some > pointers > > in which direction to debug > > > > > > > > On Sat, Mar 2, 2019 at 2:24 AM Li Jin wrote: > > > >> The 2G limit that Uwe mentioned definitely exists, Spark serialize each > >> group as a single RecordBatch currently. > >> > >> The "pyarrow.lib.ArrowIOError: read length must be positive or -1" is > >> strange, I think Spark is on an older version of the Java side (0.10 for > >> Spark 2.4 and 0.8 for Spark 2.3). I forgot whether there is binary > >> incompatibility between these versions and pyarrow 0.12. > >> > >> On Fri, Mar 1, 2019 at 3:32 PM Abdeali Kothari < > abdealikoth...@gmail.com> > >> wrote: > >> > >> > Forgot to mention: The above testing is with 0.11.1 > >> > I tried 0.12.1 as you suggested - and am getting the > >> > OversizedAllocationException with the 80char column. And getting read > >> > length must be positive or -1 without that. So, both the issues are > >> > reproducible with pyarrow 0.12.1 > >> > > >> > On Sat, Mar 2, 2019 at 1:57 AM Abdeali Kothari < > >> abdealikoth...@gmail.com> > >> > wrote: > >> > > >> > > That was spot on! > >> > > I had 3 columns with 80characters => 80*21*10^6 = 1.56 bytes > >> > > I removed these columns and replaced each with 10 doubleType columns > >> (so > >> > > it would still be 80 bytes of data) - and this error didn't come up > >> > anymore. > >> > > I also removed all the other columns and just kept 1 column with > >> > > 80characters - I got the error again. > >> > > > >> > > I'll make a simpler example and report it to spark - as I guess > these > >> > > columns would need some special handling. > >> > > > >> > > Now, when I run - I get a different error: > >> > > 19/03/01 20:16:49 WARN TaskSetManager: Lost task 108.0 in stage 8.0 > >> (TID > >> > > 12, ip-172-31-10-249.us-west-2.compute.internal, executor 1): > >> > > org.apache.spark.api.python.PythonException: Traceback (most recent > >> call > >> > > last): > >> > > File > >> > > > >> > > >> >
[jira] [Created] (ARROW-4887) [GLib] Add garrow_array_count()
Kouhei Sutou created ARROW-4887: --- Summary: [GLib] Add garrow_array_count() Key: ARROW-4887 URL: https://issues.apache.org/jira/browse/ARROW-4887 Project: Apache Arrow Issue Type: New Feature Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou Fix For: 0.13.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4886) [Rust] Inconsistent behaviour with casting sliced primitive array to list array
Neville Dipale created ARROW-4886: - Summary: [Rust] Inconsistent behaviour with casting sliced primitive array to list array Key: ARROW-4886 URL: https://issues.apache.org/jira/browse/ARROW-4886 Project: Apache Arrow Issue Type: Bug Components: Rust Affects Versions: 0.12.0 Reporter: Neville Dipale [~csun] I was going through the C++ cast implementation to see if I've missed anything, and I noticed that ListCastKernel ([https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L665]) doesn't support casting non-zero-offset arrays. So I investigated what happens in Rust ARROW-4865. I found an inconsistency where inheriting the incoming array's offset could lead us to read invalid data. I tried fixing it, but found that a buffer that I expected to be invalid was being returned as valid, but returning invalid data. I've currently disabled casting primitive to array where the offset is not zero, and I'd like to wait for ARROW-4853 so I can see how sliced lists behave, and fix this inconsistency. That might only happen in 0.14, so I'm fine with that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4885) [Python] read_csv() can't handle decimal128() columns
Diego Argueta created ARROW-4885: Summary: [Python] read_csv() can't handle decimal128() columns Key: ARROW-4885 URL: https://issues.apache.org/jira/browse/ARROW-4885 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.12.1 Environment: Python: 3.7.2, 2.7.15 PyArrow: 0.12.1 OS: MacOS 10.13.4 (High Sierra) Reporter: Diego Argueta h1. Summary CSV cannot use {{Decimal128Type}}. The cause is that there's no converter listed [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/converter.cc#L301-L315]. I haven't tested it yet but I suspect adding the following line _might_ fix it: {code:c++} CONVERTER_CASE(Type::DECIMAL, NumericConverter", line 1, in File "pyarrow/_csv.pyx", line 397, in pyarrow._csv.read_csv File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: CSV conversion to decimal(11, 2) is not supported CSV conversion to decimal(11, 2) is not supported {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4884) [C++] conda-forge thrift-cpp package not available via pkg-config or cmake
Wes McKinney created ARROW-4884: --- Summary: [C++] conda-forge thrift-cpp package not available via pkg-config or cmake Key: ARROW-4884 URL: https://issues.apache.org/jira/browse/ARROW-4884 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.13.0 Artifact of CMake refactor I opened https://github.com/conda-forge/thrift-cpp-feedstock/issues/35 about investigating why Thrift does not export the correct files -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4883) [Python] read_csv() gives mojibake if given file object in text mode
Diego Argueta created ARROW-4883: Summary: [Python] read_csv() gives mojibake if given file object in text mode Key: ARROW-4883 URL: https://issues.apache.org/jira/browse/ARROW-4883 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.12.1 Environment: Python: 3.7.2, 2.7.15 PyArrow: 0.12.1 OS: MacOS 10.13.6 (High Sierra) Reporter: Diego Argueta h1. Summary: Python 3: * {{read_csv}} returns mojibake if given file objects opened in text mode. It behaves as expected in binary mode. * Files encoded in anything other than valid UTF-8 will cause a crash. Python 2: {{read_csv}} only handles ASCII files. If given a file in UTF-8 with characters over U+007F, it crashes. h1. To reproduce: 1) Create a CSV like this {code} Header 123.45 {code} 2) Then run this code on Python 3: {code:python} >>> import pyarrow.csv as pa_csv >>> pa_csv.read_csv(open('test.csv', 'r')) pyarrow.Table 䧢: string {code} Notice the file descriptor is open in text mode. Changing the encoding doesn't help: {code:python} >>> pa_csv.read_csv(open('test.csv', 'r', encoding='utf-8')) pyarrow.Table 䧢: string >>> pa_csv.read_csv(open('test.csv', 'r', encoding='ascii')) pyarrow.Table 䧢: string >>> pa_csv.read_csv(open('test.csv', 'r', encoding='iso-8859-1')) pyarrow.Table 䧢: string {code} If I open the file in binary mode it works: {code:python} >>> pa_csv.read_csv(open('test.csv', 'rb')) >>> >>> pyarrow.Table Header: double {code} I tried this with a file encoded in UTF-16 and it freaked out: {code} Traceback (most recent call last): File "/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py", line 84, in _process_text self._execute(line) File "/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py", line 139, in _execute result_str = '%s\n' % repr(result).decode('utf-8') File "pyarrow/table.pxi", line 960, in pyarrow.lib.Table.__repr__ File "pyarrow/types.pxi", line 903, in pyarrow.lib.Schema.__str__ File "/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/compat.py", line 143, in frombytes return o.decode('utf8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte {code} Presumably this is because the code always assumes the file is in UTF-8. h2. Python 2 behavior Python 2 behaves differently -- it uses the ASCII codec by default, so when handed a file encoded in UTF-8, it will return without an error. Try to access the table... {code} >>> t = pa_csv.read_csv(open('/Users/diegoargueta/Desktop/test.csv', 'r')) >>> list(t) Traceback (most recent call last): File "/Users/diegoargueta/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py", line 84, in _process_text self._execute(line) File "/Users/diegoargueta/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py", line 139, in _execute result_str = '%s\n' % repr(result).decode('utf-8') File "pyarrow/table.pxi", line 387, in pyarrow.lib.Column.__repr__ result.write('\n{}'.format(str(self.data))) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128) 'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128) {code} h1. Expectation We should be able to hand read_csv() a file in text mode so that the CSV file can be in any text encoding. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4882) Add "Count" and "Sum" functions
Yosuke Shiro created ARROW-4882: --- Summary: Add "Count" and "Sum" functions Key: ARROW-4882 URL: https://issues.apache.org/jira/browse/ARROW-4882 Project: Apache Arrow Issue Type: New Feature Components: GLib Reporter: Yosuke Shiro Assignee: Yosuke Shiro Fix For: 0.13.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4881) [Python] bundle_zlib CMake function still uses ARROW_BUILD_TOOLCHAIN
Wes McKinney created ARROW-4881: --- Summary: [Python] bundle_zlib CMake function still uses ARROW_BUILD_TOOLCHAIN Key: ARROW-4881 URL: https://issues.apache.org/jira/browse/ARROW-4881 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.13.0 Not sure if our wheels works but: https://github.com/apache/arrow/blob/master/python/CMakeLists.txt#L278 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4880) [Python] python/asv-build.sh is probably broken after CMake refactor
Wes McKinney created ARROW-4880: --- Summary: [Python] python/asv-build.sh is probably broken after CMake refactor Key: ARROW-4880 URL: https://issues.apache.org/jira/browse/ARROW-4880 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.14.0 uses {{$ARROW_BUILD_TOOLCHAIN}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4879) [C++] cmake can't use conda's flatbuffers
Benjamin Kietzman created ARROW-4879: Summary: [C++] cmake can't use conda's flatbuffers Key: ARROW-4879 URL: https://issues.apache.org/jira/browse/ARROW-4879 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Benjamin Kietzman Assignee: Benjamin Kietzman Fix For: 0.13.0 I'm using conda's flatbuffers, but after the cmake refactor I get the following error: {code} CMake Error at cmake_modules/ThirdpartyToolchain.cmake:146 (find_package): By not providing "FindFlatbuffers.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "Flatbuffers", but CMake did not find one. Could not find a package configuration file provided by "Flatbuffers" with any of the following names: FlatbuffersConfig.cmake flatbuffers-config.cmake Add the installation prefix of "Flatbuffers" to CMAKE_PREFIX_PATH or set "Flatbuffers_DIR" to a directory containing one of the above files. If "Flatbuffers" provides a separate development package or SDK, be sure it has been installed. {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4878) [C++] ARROW_DEPENDENCY_SOURCE=CONDA does not work properly with MSVC
Wes McKinney created ARROW-4878: --- Summary: [C++] ARROW_DEPENDENCY_SOURCE=CONDA does not work properly with MSVC Key: ARROW-4878 URL: https://issues.apache.org/jira/browse/ARROW-4878 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 0.13.0 The prefix must have {{\Library}} added to it -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[Rust] Table/DataFrame style API
Hi, I have a PR open [1] to add a DataFrame/Table style API for building a logical query plan. So far I only added a couple methods to it, but here is a usage example: let t = ctx.table("aggregate_test_100")?; let t2 = t .select_columns(vec!["c1", "c2", "c11"])? .limit(10)?; This builds the same logical plan as "SELECT c1, c2, c11 FROM aggregate_test_100 LIMIT 10". Adding more methods is mostly trivial but I wanted to get this initial PR merged first so that I can add new methods as small separate PRs. I'd appreciate some reviews if anyone has the bandwidth. Thanks, Andy. [1] https://github.com/apache/arrow/pull/3671
Re: Timeline for 0.13 Arrow release
Out of the open / in-progress issues still in the backlog: C++-related: 25 C#: 2 CI-related: 2 Dev tools: 2 Docs: 4 Flight: 3 Packaging: 4 Python: 23 (14 tagged as bugs) Ruby: 1 Rust: 14 I'm going to try to grind out as many issues as I can in the next few days, and at least get a sense of "how bad" some of the Python bugs are. If we want to release _next week_, many things are not going to get done. On the bug fixing I think we should prioritize fixing regressions On Thu, Mar 14, 2019 at 10:33 AM Krisztián Szűcs wrote: > > Submitted the packaging builds: > https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93=build-452 > > On Thu, Mar 14, 2019 at 4:19 PM Wes McKinney wrote: > > > The CMake refactor is merged! Kudos to Uwe for 3+ weeks of hard labor on > > this. > > > > We should run all the packaging tasks and get a full accounting of > > what is broken so we aren't surprised during the release process > > > > On Wed, Mar 13, 2019 at 9:39 AM Krisztián Szűcs > > wrote: > > > > > > The proof of the pudding is in the eating. You convinced me. > > > > > > On Wed, Mar 13, 2019 at 3:31 PM Wes McKinney > > wrote: > > > > > > > Krisztian -- are you all right with proceeding with merging the CMake > > > > refactor? I'm pretty committed to helping fix the problems that come > > > > up. Since most consumers of the project don't test until _after_ a > > > > release, we won't find out about some problems until we merge it and > > > > release it. Thus, IMHO it doesn't make sense to wait another 8-10 > > > > weeks since we'd be delaying feedback for that long. There are also a > > > > number of follow-on issues blocking on the refactor > > > > > > > > On Tue, Mar 12, 2019 at 11:39 AM Andy Grove > > wrote: > > > > > > > > > > I've cleaned up my issues for Rust, moving most of them to 0.14.0. > > > > > > > > > > I have two PRs in progress that I would appreciate reviews on: > > > > > > > > > > https://github.com/apache/arrow/pull/3671 - [Rust] Table API (a.k.a > > > > > DataFrame) > > > > > > > > > > https://github.com/apache/arrow/pull/3851 - [Rust] Parquet data > > source > > > > in > > > > > DataFusion > > > > > > > > > > Once these are merged I have some small follow up PRs for 0.13.0 > > that I > > > > can > > > > > get done this week. > > > > > > > > > > Thanks, > > > > > > > > > > Andy. > > > > > > > > > > > > > > > On Tue, Mar 12, 2019 at 8:21 AM Wes McKinney > > > > wrote: > > > > > > > > > > > hi folks, > > > > > > > > > > > > I think we are on track to be able to release toward the end of > > this > > > > > > month. My proposed timeline: > > > > > > > > > > > > * This week (March 11-15): feature/improvement push mostly > > > > > > * Next week (March 18-22): shift to bug fixes, stabilization, empty > > > > > > backlog of feature/improvement JIRAs > > > > > > * Week of March 25: propose release candidate > > > > > > > > > > > > Does this seem reasonable? This puts us at about 9-10 weeks from > > 0.12. > > > > > > > > > > > > We need an RM for 0.13, any PMCs want to volunteer? > > > > > > > > > > > > Take a look at our release page: > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103091219 > > > > > > > > > > > > Out of the open or in-progress issues, we have: > > > > > > > > > > > > * C#: 3 issues > > > > > > * C++ (all components): 51 issues > > > > > > * Java: 3 issues > > > > > > * Python: 38 issues > > > > > > * Rust (all components): 33 issues > > > > > > > > > > > > Please help curating the backlogs for each component. There's a > > > > > > smattering of issues in other categories. There are also 10 open > > > > > > issues with No Component (and 20 resolved issues), those need their > > > > > > metadata fixed. > > > > > > > > > > > > Thanks, > > > > > > Wes > > > > > > > > > > > > On Wed, Feb 27, 2019 at 1:49 PM Wes McKinney > > > > wrote: > > > > > > > > > > > > > > The timeline for the 0.13 release is drawing closer. I would say > > we > > > > > > > should consider a release candidate either the week of March 18 > > or > > > > > > > March 25, which gives us ~3 weeks to close out backlog items. > > > > > > > > > > > > > > There are around 220 issues open or in-progress in > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.13.0+Release > > > > > > > > > > > > > > Please have a look. If issues are not assigned to someone as the > > next > > > > > > > couple of weeks pass by I'll begin moving at least C++ and Python > > > > > > > issues to 0.14 that don't seem like they're going to get done for > > > > > > > 0.13. If development stakeholders for C#, Java, Rust, Ruby, and > > other > > > > > > > components can review and curate the issues that would be > > helpful. > > > > > > > > > > > > > > You can help keep the JIRA issues tidy by making sure to add Fix > > > > > > > Version to issues and to make sure to add a Component so that > > issues > > > > > > > are properly categorized in the release
[jira] [Created] (ARROW-4877) [Plasma] CI failure in test_plasma_list
Kouhei Sutou created ARROW-4877: --- Summary: [Plasma] CI failure in test_plasma_list Key: ARROW-4877 URL: https://issues.apache.org/jira/browse/ARROW-4877 Project: Apache Arrow Issue Type: Bug Components: C++ - Plasma, Continuous Integration, Python Reporter: Kouhei Sutou Fix For: 0.14.0 https://api.travis-ci.org/v3/job/506259901/log.txt {noformat} === FAILURES === ___ test_plasma_list ___ @pytest.mark.plasma def test_plasma_list(): import pyarrow.plasma as plasma with plasma.start_plasma_store( plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY) \ as (plasma_store_name, p): plasma_client = plasma.connect(plasma_store_name) # Test sizes u, _, _ = create_object(plasma_client, 11, metadata_size=7, seal=False) l1 = plasma_client.list() assert l1[u]["data_size"] == 11 assert l1[u]["metadata_size"] == 7 # Test ref_count v = plasma_client.put(np.zeros(3)) l2 = plasma_client.list() # Ref count has already been released assert l2[v]["ref_count"] == 0 a = plasma_client.get(v) l3 = plasma_client.list() assert l3[v]["ref_count"] == 1 del a # Test state w, _, _ = create_object(plasma_client, 3, metadata_size=0, seal=False) l4 = plasma_client.list() assert l4[w]["state"] == "created" plasma_client.seal(w) l5 = plasma_client.list() assert l5[w]["state"] == "sealed" # Test timestamps t1 = time.time() x, _, _ = create_object(plasma_client, 3, metadata_size=0, seal=False) t2 = time.time() l6 = plasma_client.list() > assert math.floor(t1) <= l6[x]["create_time"] <= math.ceil(t2) E assert 1552568478 <= 1552568477 E+ where 1552568478 = (1552568478.0022461) E+where = math.floor ../../pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/tests/test_plasma.py:1070: AssertionError - Captured stderr call - I0314 13:01:17.901209 19953 store.cc:1093] Allowing the Plasma store to use up to 0.1GB of memory. I0314 13:01:17.901417 19953 store.cc:1120] Starting object store with directory /dev/shm and huge page support disabled {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4876) Port MutableBuffer to csharp
Prashanth Govindarajan created ARROW-4876: - Summary: Port MutableBuffer to csharp Key: ARROW-4876 URL: https://issues.apache.org/jira/browse/ARROW-4876 Project: Apache Arrow Issue Type: Task Components: C# Reporter: Prashanth Govindarajan C++ has a "MutableBuffer" that exposes the underlying T*. Port it to csharp. It's an easy port. ArrowBuffer at the moment is exposed as ReadOnlyMemory. The builder actually hands it a "Memory" object, so it ought to be a simple change -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Passing File Descriptors in the Low-Level API
hi Brian, This is mostly an Arrow platform question so I'm copying the Arrow mailing list. You can open a file using an existing file descriptor using ReadableFile::Open https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.h#L145 The documentation for this function says: "The file descriptor becomes owned by the ReadableFile, and will be closed on Close() or destruction." If you want to do the equivalent thing, but using memory mapping, I think you'll need to add a corresponding API to MemoryMappedFile. This is more perilous because of the API requirements of mmap -- you need to pass the right flags and they may need to be the same flags that were passed when opening the file descriptor, see https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L378 and https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L476 - Wes On Thu, Mar 14, 2019 at 1:47 PM Brian Bowman wrote: > > The ReadableFile class (arrow/io/file.cc) has utility methods where a > FileDescriptor is either passed in or returned, but I don’t see how this > surfaces through the API. > > Is there a way for application code to control the open lifetime of mmap()’d > Parquet files by passing an already open FileDescriptor to Parquet low-level > API open/close methods? > > Thanks, > > Brian >
RE: Publishing C# NuGet package
Thanks Wes. I have a PR up for this. https://github.com/apache/arrow/pull/3891 How do I update the wiki page? Is this source controlled somewhere? I assume we want to add a new section after https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRubypackages for "Updating C# NuGet package". I put the instructions for building and uploading the package in the csharp/README.md file in my PR. It should be as simple as: 1. Install the latest `.NET Core SDK` from https://dotnet.microsoft.com/download. 2. ~/git/arrow/csharp$ dotnet pack -c Release -p:VersionSuffix='' 3. upload the .nupkg and .snupkg files from ~/git/arrow/csharp/artifacts/ to https://www.nuget.org/packages/manage/upload Eric -Original Message- From: Wes McKinney Sent: Tuesday, March 12, 2019 9:36 AM To: dev@arrow.apache.org Subject: Re: Publishing C# NuGet package thanks Eric -- that sounds great. I think we're going to want to cut the 0.13 release candidate around 2 weeks from now, so that gives some time to get the packaging things sorted out - Wes On Thu, Mar 7, 2019 at 4:46 PM Eric Erhardt wrote: > > > Some changes may need to be made to the release scripts to update C# > > metadata files. The intent it to make it so that the code artifact can be > > pushed to a package manager using the official ASF release artifact. If we > > don't get it 100% right for 0.13 then > at least we can get a preliminary > > package up there and do things 100% by the books in 0.14. > > The way you build a NuGet package is you call `dotnet pack` on the `.csproj` > file. That will build the .NET assembly (.dll) and package it into a NuGet > package (.nupkg, which is a glorified .zip file). That `.nupkg` file is then > published to the nuget.org website. > > In order to publish it to nuget.org, an account will need to be made to > publish it under. Is that something a PMC member can/will do? The intention > is for the published package to be the official "Apache Arrow" nuget package. > > The .nupkg file can optionally be signed. See > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fnuget%2Fcreate-packages%2Fsign-a-packagedata=02%7C01%7CEric.Erhardt%40microsoft.com%7Ce6fd34cac9a84a6d55a208d6a6f81faa%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636879981946621803sdata=O0rddqqjMLzfkPssh3uOd1i70rgsPktaKIFD%2BdDQuTA%3Dreserved=0. > > I can create a JIRA to add all the appropriate NuGet metadata to the .csproj > in the repo. That way no file committed into the repo will need to change in > order to create the NuGet package. I can also add the instructions to create > the NuGet into the csharp README file in that PR.
[jira] [Created] (ARROW-4875) [C++] MSVC Boost warnings after CMake refactor on cmake 3.12
Wes McKinney created ARROW-4875: --- Summary: [C++] MSVC Boost warnings after CMake refactor on cmake 3.12 Key: ARROW-4875 URL: https://issues.apache.org/jira/browse/ARROW-4875 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.13.0 I haven't investigated if this was present before the refactor, but since we set {{Boost_ADDITIONAL_VERSIONS}} in theory this "scary" warning should not show up {code} CMake Warning at C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:847 (message): New Boost version may have incorrect or missing dependencies and imported targets Call Stack (most recent call first): C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:959 (_Boost_COMPONENT_DEPENDENCIES) C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:1618 (_Boost_MISSING_DEPENDENCIES) cmake_modules/ThirdpartyToolchain.cmake:1893 (find_package) CMakeLists.txt:536 (include) CMake Warning at C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:847 (message): New Boost version may have incorrect or missing dependencies and imported targets Call Stack (most recent call first): C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:959 (_Boost_COMPONENT_DEPENDENCIES) C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:1618 (_Boost_MISSING_DEPENDENCIES) cmake_modules/ThirdpartyToolchain.cmake:1893 (find_package) CMakeLists.txt:536 (include) CMake Warning at C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:847 (message): New Boost version may have incorrect or missing dependencies and imported targets Call Stack (most recent call first): C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:959 (_Boost_COMPONENT_DEPENDENCIES) C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:1618 (_Boost_MISSING_DEPENDENCIES) cmake_modules/ThirdpartyToolchain.cmake:1893 (find_package) CMakeLists.txt:536 (include) -- Boost version: 1.69.0 -- Found the following Boost libraries: -- regex -- system -- filesystem {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4874) Cannot read parquet from encrypted hdfs
Jesse Lord created ARROW-4874: - Summary: Cannot read parquet from encrypted hdfs Key: ARROW-4874 URL: https://issues.apache.org/jira/browse/ARROW-4874 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.12.0 Environment: cloudera yarn cluster, red hat enterprise 7 Reporter: Jesse Lord Using pyarrow 0.12 I was able to read parquet at first and then the admins added KMS servers and encrypted all of the files on the cluster. Now I get an error and the file system object can only read objects from the local file system of the edge node. Reproducible example: {{import pyarrow as pa fs = pa.hdfs.connect() with fs.open('/user/jlord/test_lots_of_parquet/', 'rb') as fil: _ = fil.read() }} error: {{19/03/14 10:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable hdfsOpenFile(/user/jlord/test_lots_of_parquet/): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;) error: FileNotFoundException: File /user/jlord/test_lots_of_parquet does not existjava.io.FileNotFoundException: File /user/jlord/test_lots_of_parquet does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) Traceback (most recent call last): File "local_hdfs.py", line 15, in with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: /user/jlord/test_lots_of_parquet/}} If I specify a specific parquet file in that folder I get the following error: {{19/03/14 10:07:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable hdfsOpenFile(/user/jlord/test_lots_of_parquet/part-0-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;) error: FileNotFoundException: File /user/jlord/test_lots_of_parquet/part-0-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet does not existjava.io.FileNotFoundException: File /user/jlord/test_lots_of_parquet/part-0-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) Traceback (most recent call last): File "local_hdfs.py", line 15, in with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: /user/jlord/test_lots_of_parquet/part-0-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet}} Not sure if this is relevant: spark can read continue to read the parquet files, but it takes a cloudera specific version that can read the following KMS keys from the core-site.xml and hdfs-site.xml: {{ dfs.encryption.key.provider.uri kms://ht...@server1.com;server2.com:16000/kms }} Using the open source version of spark requires changing these xml values to: {{ dfs.encryption.key.provider.uri kms://ht...@server1.com:16000/kms kms://ht...@server2.com:16000/kms }} Might need to point arrow to separate configuration xmls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Timeline for 0.13 Arrow release
Submitted the packaging builds: https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93=build-452 On Thu, Mar 14, 2019 at 4:19 PM Wes McKinney wrote: > The CMake refactor is merged! Kudos to Uwe for 3+ weeks of hard labor on > this. > > We should run all the packaging tasks and get a full accounting of > what is broken so we aren't surprised during the release process > > On Wed, Mar 13, 2019 at 9:39 AM Krisztián Szűcs > wrote: > > > > The proof of the pudding is in the eating. You convinced me. > > > > On Wed, Mar 13, 2019 at 3:31 PM Wes McKinney > wrote: > > > > > Krisztian -- are you all right with proceeding with merging the CMake > > > refactor? I'm pretty committed to helping fix the problems that come > > > up. Since most consumers of the project don't test until _after_ a > > > release, we won't find out about some problems until we merge it and > > > release it. Thus, IMHO it doesn't make sense to wait another 8-10 > > > weeks since we'd be delaying feedback for that long. There are also a > > > number of follow-on issues blocking on the refactor > > > > > > On Tue, Mar 12, 2019 at 11:39 AM Andy Grove > wrote: > > > > > > > > I've cleaned up my issues for Rust, moving most of them to 0.14.0. > > > > > > > > I have two PRs in progress that I would appreciate reviews on: > > > > > > > > https://github.com/apache/arrow/pull/3671 - [Rust] Table API (a.k.a > > > > DataFrame) > > > > > > > > https://github.com/apache/arrow/pull/3851 - [Rust] Parquet data > source > > > in > > > > DataFusion > > > > > > > > Once these are merged I have some small follow up PRs for 0.13.0 > that I > > > can > > > > get done this week. > > > > > > > > Thanks, > > > > > > > > Andy. > > > > > > > > > > > > On Tue, Mar 12, 2019 at 8:21 AM Wes McKinney > > > wrote: > > > > > > > > > hi folks, > > > > > > > > > > I think we are on track to be able to release toward the end of > this > > > > > month. My proposed timeline: > > > > > > > > > > * This week (March 11-15): feature/improvement push mostly > > > > > * Next week (March 18-22): shift to bug fixes, stabilization, empty > > > > > backlog of feature/improvement JIRAs > > > > > * Week of March 25: propose release candidate > > > > > > > > > > Does this seem reasonable? This puts us at about 9-10 weeks from > 0.12. > > > > > > > > > > We need an RM for 0.13, any PMCs want to volunteer? > > > > > > > > > > Take a look at our release page: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103091219 > > > > > > > > > > Out of the open or in-progress issues, we have: > > > > > > > > > > * C#: 3 issues > > > > > * C++ (all components): 51 issues > > > > > * Java: 3 issues > > > > > * Python: 38 issues > > > > > * Rust (all components): 33 issues > > > > > > > > > > Please help curating the backlogs for each component. There's a > > > > > smattering of issues in other categories. There are also 10 open > > > > > issues with No Component (and 20 resolved issues), those need their > > > > > metadata fixed. > > > > > > > > > > Thanks, > > > > > Wes > > > > > > > > > > On Wed, Feb 27, 2019 at 1:49 PM Wes McKinney > > > wrote: > > > > > > > > > > > > The timeline for the 0.13 release is drawing closer. I would say > we > > > > > > should consider a release candidate either the week of March 18 > or > > > > > > March 25, which gives us ~3 weeks to close out backlog items. > > > > > > > > > > > > There are around 220 issues open or in-progress in > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.13.0+Release > > > > > > > > > > > > Please have a look. If issues are not assigned to someone as the > next > > > > > > couple of weeks pass by I'll begin moving at least C++ and Python > > > > > > issues to 0.14 that don't seem like they're going to get done for > > > > > > 0.13. If development stakeholders for C#, Java, Rust, Ruby, and > other > > > > > > components can review and curate the issues that would be > helpful. > > > > > > > > > > > > You can help keep the JIRA issues tidy by making sure to add Fix > > > > > > Version to issues and to make sure to add a Component so that > issues > > > > > > are properly categorized in the release notes. > > > > > > > > > > > > Thanks > > > > > > Wes > > > > > > > > > > > > On Sat, Feb 9, 2019 at 10:39 AM Wes McKinney < > wesmck...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > See > > > > > > > > > https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide > > > > > > > > > > > > > > The source release step is one of the places where problems > occur. > > > > > > > > > > > > > > On Sat, Feb 9, 2019, 10:33 AM > > > > > >> > > > > > > >> > > > > > > >> > On Feb 8, 2019, at 9:19 AM, Uwe L. Korn > > > wrote: > > > > > > >> > > > > > > > >> > We could dockerize some of the release steps to ensure that > they > > > > > run in the same environment. > > > > > > >> > > > > > > >> I may be able to help with said Dockerization. If not
[jira] [Created] (ARROW-4873) [C++] ARROW_DEPENDENCY_SOURCE should not be overridden to CONDA if ARROW_PACKAGE_PREFIX is set by user
Wes McKinney created ARROW-4873: --- Summary: [C++] ARROW_DEPENDENCY_SOURCE should not be overridden to CONDA if ARROW_PACKAGE_PREFIX is set by user Key: ARROW-4873 URL: https://issues.apache.org/jira/browse/ARROW-4873 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 0.13.0 I use conda to manage Python dependencies but keep my C++ toolchain in a separate directory. This organizational scheme is incompatible with the new options after the CMake refactor I think if you pass {{-DARROW_PREFIX_PATH=$MY_CPP_TOOLCHAIN}} then this should not be overridden with {{$CONDA_PREFIX}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4872) [Python] Keep backward compatibility for ParquetDatasetPiece
Krisztian Szucs created ARROW-4872: -- Summary: [Python] Keep backward compatibility for ParquetDatasetPiece Key: ARROW-4872 URL: https://issues.apache.org/jira/browse/ARROW-4872 Project: Apache Arrow Issue Type: Improvement Reporter: Krisztian Szucs Fix For: 0.13.0 See https://github.com/apache/arrow/commit/f2fb02b82b60ba9a90c8bad6e5b11e37fc3ea9d3#r32722497 and https://github.com/dask/dask/pull/4587 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4870) ruby gemspec has wrong msys2 dependency listed
Dominic Sisneros created ARROW-4870: --- Summary: ruby gemspec has wrong msys2 dependency listed Key: ARROW-4870 URL: https://issues.apache.org/jira/browse/ARROW-4870 Project: Apache Arrow Issue Type: Bug Components: Ruby Affects Versions: 0.12.1 Reporter: Dominic Sisneros Fix For: 0.13.0 ruby gemspec has wrong msys2 dependency listed change mys2_mingw_dependencies to correct package pacman -Ss arrow mingw32/mingw-w64-i686-arrow 0.11.1-1 Apache Arrow is a cross-language development platform for in-memory data (mingw-w64) mingw64/mingw-w64-x86_64-arrow 0.11.1-1 [installed] Apache Arrow is a cross-language development platform for in-memory data (mingw-w64) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [Discuss][Java, Non-C++ generally] Support for 64-bit int array lengths?
hi Micah, Given the constraints from Netty in Java, I would say that it makes sense to raise an exception if encountering a Field length exceeding 2^31 - 1 in length (I think there are already some checks, but we can add more checks during the IPC metadata read pass). With shared memory / zero copy in Java happening _eventually_ (https://issues.apache.org/jira/browse/ARROW-3191) this is becoming more of a realistic issue, since someone may produce a massive dataset and then try to read it in Java. 64-bit variable-size offsets (i.e. LargeList, LargeBinary / LargeString) are a different matter. A list or varbinary vector could have 64-bit offsets, that should not cause any issues. We need these in C++ to unblock some real-world use cases with embedding large objects in Arrow data structures and reading them from shared memory with zero copy. If an implementation is unable to read such huge data structures due to structural limitations we need only document this. - Wes On Thu, Mar 14, 2019 at 4:41 AM Ravindra Pindikura wrote: > > @Jacques Nadeau would have more background on this. > Here's my understanding : > > On Thu, Mar 14, 2019 at 12:08 PM Micah Kornfield > wrote: > > > I was working on a proof of concept java implementation for LargeList [1] > > implementation (64-bit array offsets). Our Java implementation doesn't > > appear to support Vectors/Arrays larger then Integer.MAX_VALUE addressable > > space. > > > > It looks like Message.fbs was updated quite a while ago to support 64-bit > > lengths/offsets [2]. I had some questions: > > > > 1. For Java: > > * Is my assessment accurate that is doesn't support 64-bit ranged sizes? > > > > yes. > > > > * Is there a desire to support the 64 bit sizes? (I didn't come across > > any JIRAs when I did a search) > > > > no, afaik. > > > > * Is there a technical blocker for doing so? > > > > - big change > - arrow uses the netty allocator. that also uses int (32-bit) for capacity. > > https://netty.io/4.0/xref/io/netty/buffer/ByteBufAllocator.html#84 > > > * Any thoughts on approach for doing such a large change (I'm mostly > > concerned with breaking existing consumers/performance regressions)? > >- Given that the Java code base appears relatively stable, it might be > > that forking and creating a version "2.0" is the best viable option. > > > > 2. For other language implementations, is there support for 64-bit sizes > > or only 32-bit? > > > > Thanks, > > Micah > > > > P.S. It looks like our spec docs are out of date in regards to this issue, > > they still list Int::MAX_VALUE as the largest possible array, it is on my > > plate to update and consolidate them. > > > > [1] https://issues.apache.org/jira/browse/ARROW-4810 > > [2] > > > > https://github.com/apache/arrow/commit/ced9d766d70e84c4d0542c6f5d9bd57faf10781d > > > > > -- > Thanks and regards, > Ravindra.
[jira] [Created] (ARROW-4871) [Flight][Java] Handle large Flight messages
David Li created ARROW-4871: --- Summary: [Flight][Java] Handle large Flight messages Key: ARROW-4871 URL: https://issues.apache.org/jira/browse/ARROW-4871 Project: Apache Arrow Issue Type: Bug Components: FlightRPC, Java Reporter: David Li Assignee: David Li Fix For: 0.14.0 Similarly to ARROW-4421, Java/gRPC needs to be configured to allow large messages. The integration tests should also be updated to cover this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4869) [C++] Use of gmock fails in compute/kernels/util-internal-test.cc
Wes McKinney created ARROW-4869: --- Summary: [C++] Use of gmock fails in compute/kernels/util-internal-test.cc Key: ARROW-4869 URL: https://issues.apache.org/jira/browse/ARROW-4869 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.13.0 Out of the box build with {code} cmake .. -DARROW_BUILD_TESTS=ON -DARROW_BUILD_BENCHMARKS=ON -DARROW_PARQUET=ON -DARROW_GANDIVA=ON -DARROW_FLIGHT=ON -DARROW_BOOST_VENDORED=ON {code} {code} [ 57%] Building CXX object src/arrow/ipc/CMakeFiles/arrow-ipc-json-simple-test.dir/json-simple-test.cc.o /home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc:152:3: error: reference to non-static member function must be called; did you mean to call it with no arguments? ((mock).gmock_out_type).InternalExpectedAt("/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc", 152, "mock", "out_type").WillRepeatedly(Return(boolean())); ^~~ () /home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc:173:3: error: reference to non-static member function must be called; did you mean to call it with no arguments? ((mock).gmock_out_type).InternalExpectedAt("/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc", 173, "mock", "out_type").WillRepeatedly(Return(int32())); ^~~ () /home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc:192:3: error: reference to non-static member function must be called; did you mean to call it with no arguments? ((mock).gmock_out_type).InternalExpectedAt("/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc", 192, "mock", "out_type").WillRepeatedly(Return(boolean())); ^~~ () /home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc:213:3: error: reference to non-static member function must be called; did you mean to call it with no arguments? ((mock).gmock_out_type).InternalExpectedAt("/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc", 213, "mock", "out_type").WillRepeatedly(Return(int32())); ^~~ () 4 errors generated. make[2]: *** [src/arrow/compute/kernels/CMakeFiles/arrow-compute-util-internal-test.dir/util-internal-test.cc.o] Error 1 make[1]: *** [src/arrow/compute/kernels/CMakeFiles/arrow-compute-util-internal-test.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4868) [C++][Gandiva] Build fails with system Boost on Ubuntu Trusty 14.04
Wes McKinney created ARROW-4868: --- Summary: [C++][Gandiva] Build fails with system Boost on Ubuntu Trusty 14.04 Key: ARROW-4868 URL: https://issues.apache.org/jira/browse/ARROW-4868 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Wes McKinney Fix For: 0.14.0 It would be nice for things to work out of the box, but maybe not worth it. I can use vendored Boost for now {code} /usr/include/boost/functional/hash/extensions.hpp:269:20: error: no matching function for call to 'hash_value' return hash_value(val); ^~ /usr/include/boost/functional/hash/hash.hpp:249:17: note: in instantiation of member function 'boost::hash >::operator()' requested here seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2); ^ /home/wesm/code/arrow/cpp/src/gandiva/filter_cache_key.h:40:12: note: in instantiation of function template specialization 'boost::hash_combine >' requested here boost::hash_combine(result, configuration); ^ /usr/include/boost/functional/hash/extensions.hpp:70:17: note: candidate template ignored: could not match 'pair' against 'shared_ptr' std::size_t hash_value(std::pair const& v) ^ /usr/include/boost/functional/hash/extensions.hpp:79:17: note: candidate template ignored: could not match 'vector' against 'shared_ptr' std::size_t hash_value(std::vector const& v) ^ /usr/include/boost/functional/hash/extensions.hpp:85:17: note: candidate template ignored: could not match 'list' against 'shared_ptr' std::size_t hash_value(std::list const& v) ^ /usr/include/boost/functional/hash/extensions.hpp:91:17: note: candidate template ignored: could not match 'deque' against 'shared_ptr' std::size_t hash_value(std::deque const& v) ^ /usr/include/boost/functional/hash/extensions.hpp:97:17: note: candidate template ignored: could not match 'set' against 'shared_ptr' std::size_t hash_value(std::set const& v) ^ /usr/include/boost/functional/hash/extensions.hpp:103:17: note: candidate template ignored: could not match 'multiset' against 'shared_ptr' std::size_t hash_value(std::multiset const& v) ^ /usr/include/boost/functional/hash/extensions.hpp:109:17: note: candidate template ignored: could not match 'map' against 'shared_ptr' std::size_t hash_value(std::map const& v) ^ /usr/include/boost/functional/hash/extensions.hpp:115:17: note: candidate template ignored: could not match 'multimap' against 'shared_ptr' std::size_t hash_value(std::multimap const& v) ^ /usr/include/boost/functional/hash/extensions.hpp:121:17: note: candidate template ignored: could not match 'complex' against 'shared_ptr' std::size_t hash_value(std::complex const& v) ^ /usr/include/boost/functional/hash/hash.hpp:187:57: note: candidate template ignored: substitution failure [with T = std::shared_ptr]: no type named 'type' in 'boost::hash_detail::basic_numbers >' typename boost::hash_detail::basic_numbers::type hash_value(T v) ^ /usr/include/boost/functional/hash/hash.hpp:193:56: note: candidate template ignored: substitution failure [with T = std::shared_ptr]: no type named 'type' in 'boost::hash_detail::long_numbers >' typename boost::hash_detail::long_numbers::type hash_value(T v) ^ /usr/include/boost/functional/hash/hash.hpp:199:57: note: candidate template ignored: substitution failure [with T = std::shared_ptr]: no type named 'type' in 'boost::hash_detail::ulong_numbers >' typename boost::hash_detail::ulong_numbers::type hash_value(T v) ^ /usr/include/boost/functional/hash/hash.hpp:205:31: note: candidate template ignored: disabled by 'enable_if' [with T = std::shared_ptr] typename boost::enable_if, std::size_t>::type ^ /usr/include/boost/functional/hash/hash.hpp:213:36: note: candidate template ignored: could not match 'T *const' against 'const std::shared_ptr' template std::size_t hash_value(T* const& v) ^ /usr/include/boost/functional/hash/hash.hpp:306:24: note: candidate template ignored: could not match 'const T [N]' against 'const std::shared_ptr' inline std::size_t hash_value(const T ()[N]) ^ /usr/include/boost/functional/hash/hash.hpp:312:24: note: candidate template ignored: could not match 'T [N]' against 'const std::shared_ptr' inline std::size_t hash_value(T ()[N]) ^ /usr/include/boost/functional/hash/hash.hpp:319:24: note: candidate template ignored: could not match 'basic_string'
[jira] [Created] (ARROW-4866) [C++] zstd ExternalProject failing on Windows
Uwe L. Korn created ARROW-4866: -- Summary: [C++] zstd ExternalProject failing on Windows Key: ARROW-4866 URL: https://issues.apache.org/jira/browse/ARROW-4866 Project: Apache Arrow Issue Type: Bug Components: C++, Packaging Reporter: Uwe L. Korn Fix For: 0.13.0 After [https://github.com/apache/arrow/pull/3885|https://github.com/apache/arrow/pull/3885,] the zstd ExternalProject is failing in the Windows builds, see [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/23063072/job/bd0gom16atlkddtx] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Call for presentations - ApacheCon North America
Hi Apache developers, (apologies if you're receiving this e-mail multiple times on different dev@ lists) I'd like to draw your attention to call for presentations that is now open for ApacheCon North America 2019 -- marking the 20 year anniversary of ASF; that will be held in Las Vegas this September. I'm sure that you're aware of the ever increasing need for integrating different software systems, be it cloud/SaaS offerings or in-house custom applications and databases. I find that the field of system integration and the open source projects at ASF helping with that issue a great topic of interest. So I've selected, to the best of my ability, your project as one that is applicable in this area. I'm chairing a whole day track Integration track focusing on system integration projects at ASF and I would like to invite you to talk about your project. I think this is a great venue to present and get in touch with the wider ASF community. I'm specifically looking forward to talks dealing with the state of the project and customer success stories. So please submit your proposals at: https://apachecon.com/acna19/index.html (click on the IS NOW OPEN!). The submission deadline is Monday, May 13. zoran -- Zoran Regvart
Re: [Discuss][Java, Non-C++ generally] Support for 64-bit int array lengths?
@Jacques Nadeau would have more background on this. Here's my understanding : On Thu, Mar 14, 2019 at 12:08 PM Micah Kornfield wrote: > I was working on a proof of concept java implementation for LargeList [1] > implementation (64-bit array offsets). Our Java implementation doesn't > appear to support Vectors/Arrays larger then Integer.MAX_VALUE addressable > space. > > It looks like Message.fbs was updated quite a while ago to support 64-bit > lengths/offsets [2]. I had some questions: > > 1. For Java: > * Is my assessment accurate that is doesn't support 64-bit ranged sizes? > yes. > * Is there a desire to support the 64 bit sizes? (I didn't come across > any JIRAs when I did a search) > no, afaik. > * Is there a technical blocker for doing so? > - big change - arrow uses the netty allocator. that also uses int (32-bit) for capacity. https://netty.io/4.0/xref/io/netty/buffer/ByteBufAllocator.html#84 * Any thoughts on approach for doing such a large change (I'm mostly > concerned with breaking existing consumers/performance regressions)? >- Given that the Java code base appears relatively stable, it might be > that forking and creating a version "2.0" is the best viable option. > > 2. For other language implementations, is there support for 64-bit sizes > or only 32-bit? > > Thanks, > Micah > > P.S. It looks like our spec docs are out of date in regards to this issue, > they still list Int::MAX_VALUE as the largest possible array, it is on my > plate to update and consolidate them. > > [1] https://issues.apache.org/jira/browse/ARROW-4810 > [2] > > https://github.com/apache/arrow/commit/ced9d766d70e84c4d0542c6f5d9bd57faf10781d > -- Thanks and regards, Ravindra.
[jira] [Created] (ARROW-4865) [Rust] Support casting lists and primitives to lists
Neville Dipale created ARROW-4865: - Summary: [Rust] Support casting lists and primitives to lists Key: ARROW-4865 URL: https://issues.apache.org/jira/browse/ARROW-4865 Project: Apache Arrow Issue Type: Improvement Components: Rust Affects Versions: 0.12.0 Reporter: Neville Dipale This adds support for casting between list arrays and from primitive arrays to single-value list arrays -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[Discuss][Java, Non-C++ generally] Support for 64-bit int array lengths?
I was working on a proof of concept java implementation for LargeList [1] implementation (64-bit array offsets). Our Java implementation doesn't appear to support Vectors/Arrays larger then Integer.MAX_VALUE addressable space. It looks like Message.fbs was updated quite a while ago to support 64-bit lengths/offsets [2]. I had some questions: 1. For Java: * Is my assessment accurate that is doesn't support 64-bit ranged sizes? * Is there a desire to support the 64 bit sizes? (I didn't come across any JIRAs when I did a search) * Is there a technical blocker for doing so? * Any thoughts on approach for doing such a large change (I'm mostly concerned with breaking existing consumers/performance regressions)? - Given that the Java code base appears relatively stable, it might be that forking and creating a version "2.0" is the best viable option. 2. For other language implementations, is there support for 64-bit sizes or only 32-bit? Thanks, Micah P.S. It looks like our spec docs are out of date in regards to this issue, they still list Int::MAX_VALUE as the largest possible array, it is on my plate to update and consolidate them. [1] https://issues.apache.org/jira/browse/ARROW-4810 [2] https://github.com/apache/arrow/commit/ced9d766d70e84c4d0542c6f5d9bd57faf10781d