Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-14 Thread Micah Kornfield
Hi,
>From the error it looks like this might potentially be  some sort of
integer overflow, but it is hard to say.  Could you try to get a minimal
reproduction of the error [1] , and open a JIRA  Issue [2] with it?

Thanks,
Micah

[1] https://stackoverflow.com/help/mcve
[2] https://issues.apache.org

On Sunday, March 10, 2019, Abdeali Kothari  wrote:

> Hi, any help on this would be much appreciated.
> I've not been able to figure out any reason for this to happen yet
>
> On Sat, Mar 2, 2019, 11:50 Abdeali Kothari 
> wrote:
>
> > Hi Li Jin, thanks for the note.
> >
> > I get this error only for larger data - when I reduce the number of
> > records or the number or columns in my data it all works fine - so if it
> is
> > binary incompatibility it should be something related to large data.
> > I am using Spark 2.3.1 on Amazon EMR for this testing.
> > https://github.com/apache/spark/blob/v2.3.1/pom.xml#L192 seems to
> > indicate arrow version is 0.8 for this.
> >
> > I installed pyarrow-0.8.0 in the python environment on my cluster with
> pip
> > and I am still getting this error.
> > The stacktrace is very similar, just some lines moved in the pxi files:
> >
> > Caused by: org.apache.spark.api.python.PythonException: Traceback (most
> > recent call last):
> >   File
> >
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0018/container_1551469777576_0018_01_02/pyspark.zip/pyspark/worker.py",
> > line 230, in main
> > process()
> >   File
> >
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0018/container_1551469777576_0018_01_02/pyspark.zip/pyspark/worker.py",
> > line 225, in process
> > serializer.dump_stream(func(split_index, iterator), outfile)
> >   File
> >
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0018/container_1551469777576_0018_01_02/pyspark.zip/pyspark/serializers.py",
> > line 260, in dump_stream
> > for series in iterator:
> >   File
> >
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0018/container_1551469777576_0018_01_02/pyspark.zip/pyspark/serializers.py",
> > line 279, in load_stream
> > for batch in reader:
> >   File "pyarrow/ipc.pxi", line 268, in __iter__
> > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:70278)
> >   File "pyarrow/ipc.pxi", line 284, in
> > pyarrow.lib._RecordBatchReader.read_next_batch
> > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:70534)
> >   File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status
> > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8345)
> > pyarrow.lib.ArrowIOError: read length must be positive or -1
> >
> > Other notes:
> >  - My data is just integers, strings, and doubles. No complex types like
> > arrays/maps/etc.
> >  - I don't have any NULL/None values in my data
> >  - Increasing executor-memory for spark does not seem to help here
> >
> > As always: Any thoughts or notes would be great so I can get some
> pointers
> > in which direction to debug
> >
> >
> >
> > On Sat, Mar 2, 2019 at 2:24 AM Li Jin  wrote:
> >
> >> The 2G limit that Uwe mentioned definitely exists, Spark serialize each
> >> group as a single RecordBatch currently.
> >>
> >> The "pyarrow.lib.ArrowIOError: read length must be positive or -1" is
> >> strange, I think Spark is on an older version of the Java side (0.10 for
> >> Spark 2.4 and 0.8 for Spark 2.3). I forgot whether there is binary
> >> incompatibility between these versions and pyarrow 0.12.
> >>
> >> On Fri, Mar 1, 2019 at 3:32 PM Abdeali Kothari <
> abdealikoth...@gmail.com>
> >> wrote:
> >>
> >> > Forgot to mention: The above testing is with 0.11.1
> >> > I tried 0.12.1 as you suggested - and am getting the
> >> > OversizedAllocationException with the 80char column. And getting read
> >> > length must be positive or -1 without that. So, both the issues are
> >> > reproducible with pyarrow 0.12.1
> >> >
> >> > On Sat, Mar 2, 2019 at 1:57 AM Abdeali Kothari <
> >> abdealikoth...@gmail.com>
> >> > wrote:
> >> >
> >> > > That was spot on!
> >> > > I had 3 columns with 80characters => 80*21*10^6 = 1.56 bytes
> >> > > I removed these columns and replaced each with 10 doubleType columns
> >> (so
> >> > > it would still be 80 bytes of data) - and this error didn't come up
> >> > anymore.
> >> > > I also removed all the other columns and just kept 1 column with
> >> > > 80characters - I got the error again.
> >> > >
> >> > > I'll make a simpler example and report it to spark - as I guess
> these
> >> > > columns would need some special handling.
> >> > >
> >> > > Now, when I run - I get a different error:
> >> > > 19/03/01 20:16:49 WARN TaskSetManager: Lost task 108.0 in stage 8.0
> >> (TID
> >> > > 12, ip-172-31-10-249.us-west-2.compute.internal, executor 1):
> >> > > org.apache.spark.api.python.PythonException: Traceback (most recent
> >> call
> >> > > last):
> >> > >   File
> >> > >
> >> >
> >>
> 

[jira] [Created] (ARROW-4887) [GLib] Add garrow_array_count()

2019-03-14 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-4887:
---

 Summary: [GLib] Add garrow_array_count()
 Key: ARROW-4887
 URL: https://issues.apache.org/jira/browse/ARROW-4887
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 0.13.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4886) [Rust] Inconsistent behaviour with casting sliced primitive array to list array

2019-03-14 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-4886:
-

 Summary: [Rust] Inconsistent behaviour with casting sliced 
primitive array to list array
 Key: ARROW-4886
 URL: https://issues.apache.org/jira/browse/ARROW-4886
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 0.12.0
Reporter: Neville Dipale


[~csun] I was going through the C++ cast implementation to see if I've missed 
anything, and I noticed that ListCastKernel 
([https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L665])
 doesn't support casting non-zero-offset arrays. So I investigated what happens 
in Rust ARROW-4865. I found an inconsistency where inheriting the incoming 
array's offset could lead us to read invalid data.

I tried fixing it, but found that a buffer that I expected to be invalid was 
being returned as valid, but returning invalid data.

I've currently disabled casting primitive to array where the offset is not 
zero, and I'd like to wait for ARROW-4853 so I can see how sliced lists behave, 
and fix this inconsistency. That might only happen in 0.14, so I'm fine with 
that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4885) [Python] read_csv() can't handle decimal128() columns

2019-03-14 Thread Diego Argueta (JIRA)
Diego Argueta created ARROW-4885:


 Summary: [Python] read_csv() can't handle decimal128() columns
 Key: ARROW-4885
 URL: https://issues.apache.org/jira/browse/ARROW-4885
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.12.1
 Environment: Python: 3.7.2, 2.7.15
PyArrow: 0.12.1
OS: MacOS 10.13.4 (High Sierra)
Reporter: Diego Argueta


h1. Summary

CSV cannot use {{Decimal128Type}}. The cause is that there's no converter 
listed 
[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/converter.cc#L301-L315].
 I haven't tested it yet but I suspect adding the following line _might_ fix it:

{code:c++}
CONVERTER_CASE(Type::DECIMAL, NumericConverter", line 1, in 
  File "pyarrow/_csv.pyx", line 397, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: CSV conversion to decimal(11, 2) is not 
supported

CSV conversion to decimal(11, 2) is not supported
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4884) [C++] conda-forge thrift-cpp package not available via pkg-config or cmake

2019-03-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4884:
---

 Summary: [C++] conda-forge thrift-cpp package not available via 
pkg-config or cmake
 Key: ARROW-4884
 URL: https://issues.apache.org/jira/browse/ARROW-4884
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.13.0


Artifact of CMake refactor

I opened https://github.com/conda-forge/thrift-cpp-feedstock/issues/35 about 
investigating why Thrift does not export the correct files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4883) [Python] read_csv() gives mojibake if given file object in text mode

2019-03-14 Thread Diego Argueta (JIRA)
Diego Argueta created ARROW-4883:


 Summary: [Python] read_csv() gives mojibake if given file object 
in text mode
 Key: ARROW-4883
 URL: https://issues.apache.org/jira/browse/ARROW-4883
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.12.1
 Environment: Python: 3.7.2, 2.7.15
PyArrow: 0.12.1
OS: MacOS 10.13.6 (High Sierra)
Reporter: Diego Argueta


h1. Summary:

Python 3:

* {{read_csv}} returns mojibake if given file objects opened in text mode. It 
behaves as expected in binary mode.
* Files encoded in anything other than valid UTF-8 will cause a crash.

Python 2:

{{read_csv}} only handles ASCII files. If given a file in UTF-8 with characters 
over U+007F, it crashes.

h1. To reproduce:

1) Create a CSV like this

{code}
Header
123.45
{code}

2) Then run this code on Python 3:

{code:python}
>>> import pyarrow.csv as pa_csv
>>> pa_csv.read_csv(open('test.csv', 'r'))
pyarrow.Table
䧢: string
{code}

Notice the file descriptor is open in text mode. Changing the encoding doesn't 
help:

{code:python}
>>> pa_csv.read_csv(open('test.csv', 'r', encoding='utf-8'))
pyarrow.Table
䧢: string

>>> pa_csv.read_csv(open('test.csv', 'r', encoding='ascii'))
pyarrow.Table
䧢: string

>>> pa_csv.read_csv(open('test.csv', 'r', encoding='iso-8859-1'))
pyarrow.Table
䧢: string
{code}

If I open the file in binary mode it works:

{code:python}
>>> pa_csv.read_csv(open('test.csv', 'rb')) 
>>> 
>>> 
pyarrow.Table
Header: double
{code}

I tried this with a file encoded in UTF-16 and it freaked out:

{code}  

Traceback (most recent call last):
  File 
"/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py",
 line 84, in _process_text
self._execute(line)
  File 
"/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py",
 line 139, in _execute
result_str = '%s\n' % repr(result).decode('utf-8')
  File "pyarrow/table.pxi", line 960, in pyarrow.lib.Table.__repr__
  File "pyarrow/types.pxi", line 903, in pyarrow.lib.Schema.__str__
  File 
"/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/compat.py",
 line 143, in frombytes
return o.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid 
start byte

'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
{code}

Presumably this is because the code always assumes the file is in UTF-8.

h2. Python 2 behavior

Python 2 behaves differently -- it uses the ASCII codec by default, so when 
handed a file encoded in UTF-8, it will return without an error. Try to access 
the table...

{code}
>>> t = pa_csv.read_csv(open('/Users/diegoargueta/Desktop/test.csv', 'r'))

>>> list(t)
Traceback (most recent call last):
  File 
"/Users/diegoargueta/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py",
 line 84, in _process_text
self._execute(line)
  File 
"/Users/diegoargueta/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py",
 line 139, in _execute
result_str = '%s\n' % repr(result).decode('utf-8')
  File "pyarrow/table.pxi", line 387, in pyarrow.lib.Column.__repr__
result.write('\n{}'.format(str(self.data)))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 11: 
ordinal not in range(128)

'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128)
{code}


h1. Expectation

We should be able to hand read_csv() a file in text mode so that the CSV file 
can be in any text encoding. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4882) Add "Count" and "Sum" functions

2019-03-14 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-4882:
---

 Summary: Add "Count" and "Sum" functions
 Key: ARROW-4882
 URL: https://issues.apache.org/jira/browse/ARROW-4882
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro
 Fix For: 0.13.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4881) [Python] bundle_zlib CMake function still uses ARROW_BUILD_TOOLCHAIN

2019-03-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4881:
---

 Summary: [Python] bundle_zlib CMake function still uses 
ARROW_BUILD_TOOLCHAIN
 Key: ARROW-4881
 URL: https://issues.apache.org/jira/browse/ARROW-4881
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.13.0


Not sure if our wheels works but: 
https://github.com/apache/arrow/blob/master/python/CMakeLists.txt#L278



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4880) [Python] python/asv-build.sh is probably broken after CMake refactor

2019-03-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4880:
---

 Summary: [Python] python/asv-build.sh is probably broken after 
CMake refactor
 Key: ARROW-4880
 URL: https://issues.apache.org/jira/browse/ARROW-4880
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.14.0


uses {{$ARROW_BUILD_TOOLCHAIN}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4879) [C++] cmake can't use conda's flatbuffers

2019-03-14 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-4879:


 Summary: [C++] cmake can't use conda's flatbuffers
 Key: ARROW-4879
 URL: https://issues.apache.org/jira/browse/ARROW-4879
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman
 Fix For: 0.13.0


I'm using conda's flatbuffers, but after the cmake refactor I get the following 
error:
{code}
CMake Error at cmake_modules/ThirdpartyToolchain.cmake:146 (find_package):
  By not providing "FindFlatbuffers.cmake" in CMAKE_MODULE_PATH this project
  has asked CMake to find a package configuration file provided by
  "Flatbuffers", but CMake did not find one.

  Could not find a package configuration file provided by "Flatbuffers" with
  any of the following names:

FlatbuffersConfig.cmake
flatbuffers-config.cmake

  Add the installation prefix of "Flatbuffers" to CMAKE_PREFIX_PATH or set
  "Flatbuffers_DIR" to a directory containing one of the above files.  If
  "Flatbuffers" provides a separate development package or SDK, be sure it
  has been installed.
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4878) [C++] ARROW_DEPENDENCY_SOURCE=CONDA does not work properly with MSVC

2019-03-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4878:
---

 Summary: [C++] ARROW_DEPENDENCY_SOURCE=CONDA does not work 
properly with MSVC
 Key: ARROW-4878
 URL: https://issues.apache.org/jira/browse/ARROW-4878
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.13.0


The prefix must have {{\Library}} added to it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[Rust] Table/DataFrame style API

2019-03-14 Thread Andy Grove
Hi,

I have a PR open [1] to add a DataFrame/Table style API for building a
logical query plan. So far I only added a couple methods to it, but here is
a usage example:

let t = ctx.table("aggregate_test_100")?;

let t2 = t
  .select_columns(vec!["c1", "c2", "c11"])?
  .limit(10)?;

This builds the same logical plan as "SELECT c1, c2, c11 FROM
aggregate_test_100 LIMIT 10".

Adding more methods is mostly trivial but I wanted to get this initial PR
merged first so that I can add new methods as small separate PRs.

I'd appreciate some reviews if anyone has the bandwidth.

Thanks,

Andy.

[1] https://github.com/apache/arrow/pull/3671


Re: Timeline for 0.13 Arrow release

2019-03-14 Thread Wes McKinney
Out of the open / in-progress issues still in the backlog:

C++-related: 25
C#: 2
CI-related: 2
Dev tools: 2
Docs: 4
Flight: 3
Packaging: 4
Python: 23 (14 tagged as bugs)
Ruby: 1
Rust: 14

I'm going to try to grind out as many issues as I can in the next few
days, and at least get a sense of "how bad" some of the Python bugs
are. If we want to release _next week_, many things are not going to
get done. On the bug fixing I think we should prioritize fixing
regressions

On Thu, Mar 14, 2019 at 10:33 AM Krisztián Szűcs
 wrote:
>
> Submitted the packaging builds:
> https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93=build-452
>
> On Thu, Mar 14, 2019 at 4:19 PM Wes McKinney  wrote:
>
> > The CMake refactor is merged! Kudos to Uwe for 3+ weeks of hard labor on
> > this.
> >
> > We should run all the packaging tasks and get a full accounting of
> > what is broken so we aren't surprised during the release process
> >
> > On Wed, Mar 13, 2019 at 9:39 AM Krisztián Szűcs
> >  wrote:
> > >
> > > The proof of the pudding is in the eating. You convinced me.
> > >
> > > On Wed, Mar 13, 2019 at 3:31 PM Wes McKinney 
> > wrote:
> > >
> > > > Krisztian -- are you all right with proceeding with merging the CMake
> > > > refactor? I'm pretty committed to helping fix the problems that come
> > > > up. Since most consumers of the project don't test until _after_ a
> > > > release, we won't find out about some problems until we merge it and
> > > > release it. Thus, IMHO it doesn't make sense to wait another 8-10
> > > > weeks since we'd be delaying feedback for that long. There are also a
> > > > number of follow-on issues blocking on the refactor
> > > >
> > > > On Tue, Mar 12, 2019 at 11:39 AM Andy Grove 
> > wrote:
> > > > >
> > > > > I've cleaned up my issues for Rust, moving most of them to 0.14.0.
> > > > >
> > > > > I have two PRs in progress that I would appreciate reviews on:
> > > > >
> > > > > https://github.com/apache/arrow/pull/3671 - [Rust] Table API (a.k.a
> > > > > DataFrame)
> > > > >
> > > > > https://github.com/apache/arrow/pull/3851 - [Rust] Parquet data
> > source
> > > > in
> > > > > DataFusion
> > > > >
> > > > > Once these are merged I have some small follow up PRs for 0.13.0
> > that I
> > > > can
> > > > > get done this week.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > > >
> > > > >
> > > > > On Tue, Mar 12, 2019 at 8:21 AM Wes McKinney 
> > > > wrote:
> > > > >
> > > > > > hi folks,
> > > > > >
> > > > > > I think we are on track to be able to release toward the end of
> > this
> > > > > > month. My proposed timeline:
> > > > > >
> > > > > > * This week (March 11-15): feature/improvement push mostly
> > > > > > * Next week (March 18-22): shift to bug fixes, stabilization, empty
> > > > > > backlog of feature/improvement JIRAs
> > > > > > * Week of March 25: propose release candidate
> > > > > >
> > > > > > Does this seem reasonable? This puts us at about 9-10 weeks from
> > 0.12.
> > > > > >
> > > > > > We need an RM for 0.13, any PMCs want to volunteer?
> > > > > >
> > > > > > Take a look at our release page:
> > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103091219
> > > > > >
> > > > > > Out of the open or in-progress issues, we have:
> > > > > >
> > > > > > * C#: 3 issues
> > > > > > * C++ (all components): 51 issues
> > > > > > * Java: 3 issues
> > > > > > * Python: 38 issues
> > > > > > * Rust (all components): 33 issues
> > > > > >
> > > > > > Please help curating the backlogs for each component. There's a
> > > > > > smattering of issues in other categories. There are also 10 open
> > > > > > issues with No Component (and 20 resolved issues), those need their
> > > > > > metadata fixed.
> > > > > >
> > > > > > Thanks,
> > > > > > Wes
> > > > > >
> > > > > > On Wed, Feb 27, 2019 at 1:49 PM Wes McKinney 
> > > > wrote:
> > > > > > >
> > > > > > > The timeline for the 0.13 release is drawing closer. I would say
> > we
> > > > > > > should consider a release candidate either the week of March 18
> > or
> > > > > > > March 25, which gives us ~3 weeks to close out backlog items.
> > > > > > >
> > > > > > > There are around 220 issues open or in-progress in
> > > > > > >
> > > > > > >
> > > > https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.13.0+Release
> > > > > > >
> > > > > > > Please have a look. If issues are not assigned to someone as the
> > next
> > > > > > > couple of weeks pass by I'll begin moving at least C++ and Python
> > > > > > > issues to 0.14 that don't seem like they're going to get done for
> > > > > > > 0.13. If development stakeholders for C#, Java, Rust, Ruby, and
> > other
> > > > > > > components can review and curate the issues that would be
> > helpful.
> > > > > > >
> > > > > > > You can help keep the JIRA issues tidy by making sure to add Fix
> > > > > > > Version to issues and to make sure to add a Component so that
> > issues
> > > > > > > are properly categorized in the release 

[jira] [Created] (ARROW-4877) [Plasma] CI failure in test_plasma_list

2019-03-14 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-4877:
---

 Summary: [Plasma] CI failure in test_plasma_list
 Key: ARROW-4877
 URL: https://issues.apache.org/jira/browse/ARROW-4877
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Plasma, Continuous Integration, Python
Reporter: Kouhei Sutou
 Fix For: 0.14.0


https://api.travis-ci.org/v3/job/506259901/log.txt

{noformat}
=== FAILURES ===
___ test_plasma_list ___

@pytest.mark.plasma
def test_plasma_list():
import pyarrow.plasma as plasma

with plasma.start_plasma_store(
plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY) \
as (plasma_store_name, p):
plasma_client = plasma.connect(plasma_store_name)

# Test sizes
u, _, _ = create_object(plasma_client, 11, metadata_size=7, 
seal=False)
l1 = plasma_client.list()
assert l1[u]["data_size"] == 11
assert l1[u]["metadata_size"] == 7

# Test ref_count
v = plasma_client.put(np.zeros(3))
l2 = plasma_client.list()
# Ref count has already been released
assert l2[v]["ref_count"] == 0
a = plasma_client.get(v)
l3 = plasma_client.list()
assert l3[v]["ref_count"] == 1
del a

# Test state
w, _, _ = create_object(plasma_client, 3, metadata_size=0, 
seal=False)
l4 = plasma_client.list()
assert l4[w]["state"] == "created"
plasma_client.seal(w)
l5 = plasma_client.list()
assert l5[w]["state"] == "sealed"

# Test timestamps
t1 = time.time()
x, _, _ = create_object(plasma_client, 3, metadata_size=0, 
seal=False)
t2 = time.time()
l6 = plasma_client.list()
>   assert math.floor(t1) <= l6[x]["create_time"] <= math.ceil(t2)
E   assert 1552568478 <= 1552568477
E+  where 1552568478 = (1552568478.0022461)
E+where  = math.floor

../../pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/tests/test_plasma.py:1070:
 AssertionError
- Captured stderr call -
I0314 13:01:17.901209 19953 store.cc:1093] Allowing the Plasma store to use up 
to 0.1GB of memory.
I0314 13:01:17.901417 19953 store.cc:1120] Starting object store with directory 
/dev/shm and huge page support disabled
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4876) Port MutableBuffer to csharp

2019-03-14 Thread Prashanth Govindarajan (JIRA)
Prashanth Govindarajan created ARROW-4876:
-

 Summary: Port MutableBuffer to csharp
 Key: ARROW-4876
 URL: https://issues.apache.org/jira/browse/ARROW-4876
 Project: Apache Arrow
  Issue Type: Task
  Components: C#
Reporter: Prashanth Govindarajan


C++ has a "MutableBuffer" that exposes the underlying T*. Port it to csharp. 

It's an easy port. ArrowBuffer at the moment is exposed as ReadOnlyMemory. The 
builder actually hands it a "Memory" object, so it ought to be a simple change



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Passing File Descriptors in the Low-Level API

2019-03-14 Thread Wes McKinney
hi Brian,

This is mostly an Arrow platform question so I'm copying the Arrow mailing list.

You can open a file using an existing file descriptor using ReadableFile::Open

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.h#L145

The documentation for this function says:

"The file descriptor becomes owned by the ReadableFile, and will be
closed on Close() or destruction."

If you want to do the equivalent thing, but using memory mapping, I
think you'll need to add a corresponding API to MemoryMappedFile. This
is more perilous because of the API requirements of mmap -- you need
to pass the right flags and they may need to be the same flags that
were passed when opening the file descriptor, see

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L378

and

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L476

- Wes

On Thu, Mar 14, 2019 at 1:47 PM Brian Bowman  wrote:
>
>  The ReadableFile class (arrow/io/file.cc) has utility methods where a 
> FileDescriptor is either passed in or returned, but I don’t see how this 
> surfaces through the API.
>
> Is there a way for application code to control the open lifetime of mmap()’d 
> Parquet files by passing an already open FileDescriptor to Parquet low-level 
> API open/close methods?
>
> Thanks,
>
> Brian
>


RE: Publishing C# NuGet package

2019-03-14 Thread Eric Erhardt
Thanks Wes. I have a PR up for this.  https://github.com/apache/arrow/pull/3891

How do I update the wiki page? Is this source controlled somewhere?  I assume 
we want to add a new section after 
https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingRubypackages
 for "Updating C# NuGet package".

I put the instructions for building and uploading the package in the 
csharp/README.md file in my PR. It should be as simple as:

1. Install the latest `.NET Core SDK` from 
https://dotnet.microsoft.com/download.
2. ~/git/arrow/csharp$ dotnet pack -c Release -p:VersionSuffix=''
3. upload the .nupkg and .snupkg files from ~/git/arrow/csharp/artifacts/ to 
https://www.nuget.org/packages/manage/upload

Eric

-Original Message-
From: Wes McKinney  
Sent: Tuesday, March 12, 2019 9:36 AM
To: dev@arrow.apache.org
Subject: Re: Publishing C# NuGet package

thanks Eric -- that sounds great. I think we're going to want to cut the 0.13 
release candidate around 2 weeks from now, so that gives some time to get the 
packaging things sorted out

- Wes

On Thu, Mar 7, 2019 at 4:46 PM Eric Erhardt 
 wrote:
>
> > Some changes may need to be made to the release scripts to update C# 
> > metadata files. The intent it to make it so that the code artifact can be 
> > pushed to a package manager using the official ASF release artifact. If we 
> > don't get it 100% right for 0.13 then > at least we can get a preliminary 
> > package up there and do things 100% by the books in 0.14.
>
> The way you build a NuGet package is you call `dotnet pack` on the `.csproj` 
> file. That will build the .NET assembly (.dll) and package it into a NuGet 
> package (.nupkg, which is a glorified .zip file). That `.nupkg` file is then 
> published to the nuget.org website.
>
> In order to publish it to nuget.org, an account will need to be made to 
> publish it under. Is that something a PMC member can/will do? The intention 
> is for the published package to be the official "Apache Arrow" nuget package.
>
> The .nupkg file can optionally be signed. See 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fnuget%2Fcreate-packages%2Fsign-a-packagedata=02%7C01%7CEric.Erhardt%40microsoft.com%7Ce6fd34cac9a84a6d55a208d6a6f81faa%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636879981946621803sdata=O0rddqqjMLzfkPssh3uOd1i70rgsPktaKIFD%2BdDQuTA%3Dreserved=0.
>
> I can create a JIRA to add all the appropriate NuGet metadata to the .csproj 
> in the repo. That way no file committed into the repo will need to change in 
> order to create the NuGet package. I can also add the instructions to create 
> the NuGet into the csharp README file in that PR.


[jira] [Created] (ARROW-4875) [C++] MSVC Boost warnings after CMake refactor on cmake 3.12

2019-03-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4875:
---

 Summary: [C++] MSVC Boost warnings after CMake refactor on cmake 
3.12
 Key: ARROW-4875
 URL: https://issues.apache.org/jira/browse/ARROW-4875
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.13.0


I haven't investigated if this was present before the refactor, but since we 
set {{Boost_ADDITIONAL_VERSIONS}} in theory this "scary" warning should not 
show up

{code}
CMake Warning at C:/Program Files (x86)/Microsoft Visual 
Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:847
 (message):
  New Boost version may have incorrect or missing dependencies and imported
  targets
Call Stack (most recent call first):
  C:/Program Files (x86)/Microsoft Visual 
Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:959
 (_Boost_COMPONENT_DEPENDENCIES)
  C:/Program Files (x86)/Microsoft Visual 
Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:1618
 (_Boost_MISSING_DEPENDENCIES)
  cmake_modules/ThirdpartyToolchain.cmake:1893 (find_package)
  CMakeLists.txt:536 (include)


CMake Warning at C:/Program Files (x86)/Microsoft Visual 
Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:847
 (message):
  New Boost version may have incorrect or missing dependencies and imported
  targets
Call Stack (most recent call first):
  C:/Program Files (x86)/Microsoft Visual 
Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:959
 (_Boost_COMPONENT_DEPENDENCIES)
  C:/Program Files (x86)/Microsoft Visual 
Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:1618
 (_Boost_MISSING_DEPENDENCIES)
  cmake_modules/ThirdpartyToolchain.cmake:1893 (find_package)
  CMakeLists.txt:536 (include)


CMake Warning at C:/Program Files (x86)/Microsoft Visual 
Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:847
 (message):
  New Boost version may have incorrect or missing dependencies and imported
  targets
Call Stack (most recent call first):
  C:/Program Files (x86)/Microsoft Visual 
Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:959
 (_Boost_COMPONENT_DEPENDENCIES)
  C:/Program Files (x86)/Microsoft Visual 
Studio/2017/Community/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/share/cmake-3.12/Modules/FindBoost.cmake:1618
 (_Boost_MISSING_DEPENDENCIES)
  cmake_modules/ThirdpartyToolchain.cmake:1893 (find_package)
  CMakeLists.txt:536 (include)


-- Boost version: 1.69.0
-- Found the following Boost libraries:
--   regex
--   system
--   filesystem
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4874) Cannot read parquet from encrypted hdfs

2019-03-14 Thread Jesse Lord (JIRA)
Jesse Lord created ARROW-4874:
-

 Summary: Cannot read parquet from encrypted hdfs
 Key: ARROW-4874
 URL: https://issues.apache.org/jira/browse/ARROW-4874
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.12.0
 Environment: cloudera yarn cluster, red hat enterprise 7
Reporter: Jesse Lord


Using pyarrow 0.12 I was able to read parquet at first and then the admins 
added KMS servers and encrypted all of the files on the cluster. Now I get an 
error and the file system object can only read objects from the local file 
system of the edge node.

Reproducible example:

 

{{import pyarrow as pa fs = pa.hdfs.connect() with 
fs.open('/user/jlord/test_lots_of_parquet/', 'rb') as fil: _ = fil.read() }}

error:

 

{{19/03/14 10:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable 
hdfsOpenFile(/user/jlord/test_lots_of_parquet/): 
FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;)
 error: FileNotFoundException: File /user/jlord/test_lots_of_parquet does not 
existjava.io.FileNotFoundException: File /user/jlord/test_lots_of_parquet does 
not exist at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
 at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) 
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
 at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) 
Traceback (most recent call last): File "local_hdfs.py", line 15, in  
with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in 
pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in 
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: 
/user/jlord/test_lots_of_parquet/}}

 

If I specify a specific parquet file in that folder I get the following error:

 

{{19/03/14 10:07:32 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable 
hdfsOpenFile(/user/jlord/test_lots_of_parquet/part-0-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet):
 
FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;)
 error: FileNotFoundException: File 
/user/jlord/test_lots_of_parquet/part-0-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet
 does not existjava.io.FileNotFoundException: File 
/user/jlord/test_lots_of_parquet/part-0-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet
 does not exist at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
 at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) 
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
 at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) 
Traceback (most recent call last): File "local_hdfs.py", line 15, in  
with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in 
pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in 
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: 
/user/jlord/test_lots_of_parquet/part-0-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet}}

 

Not sure if this is relevant: spark can read continue to read the parquet 
files, but it takes a cloudera specific version that can read the following KMS 
keys from the core-site.xml and hdfs-site.xml:

 

{{ dfs.encryption.key.provider.uri 
kms://ht...@server1.com;server2.com:16000/kms }}

 

Using the open source version of spark requires changing these xml values to:

 

{{ dfs.encryption.key.provider.uri 
kms://ht...@server1.com:16000/kms 
kms://ht...@server2.com:16000/kms }}

 

Might need to point arrow to separate configuration xmls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Timeline for 0.13 Arrow release

2019-03-14 Thread Krisztián Szűcs
Submitted the packaging builds:
https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93=build-452

On Thu, Mar 14, 2019 at 4:19 PM Wes McKinney  wrote:

> The CMake refactor is merged! Kudos to Uwe for 3+ weeks of hard labor on
> this.
>
> We should run all the packaging tasks and get a full accounting of
> what is broken so we aren't surprised during the release process
>
> On Wed, Mar 13, 2019 at 9:39 AM Krisztián Szűcs
>  wrote:
> >
> > The proof of the pudding is in the eating. You convinced me.
> >
> > On Wed, Mar 13, 2019 at 3:31 PM Wes McKinney 
> wrote:
> >
> > > Krisztian -- are you all right with proceeding with merging the CMake
> > > refactor? I'm pretty committed to helping fix the problems that come
> > > up. Since most consumers of the project don't test until _after_ a
> > > release, we won't find out about some problems until we merge it and
> > > release it. Thus, IMHO it doesn't make sense to wait another 8-10
> > > weeks since we'd be delaying feedback for that long. There are also a
> > > number of follow-on issues blocking on the refactor
> > >
> > > On Tue, Mar 12, 2019 at 11:39 AM Andy Grove 
> wrote:
> > > >
> > > > I've cleaned up my issues for Rust, moving most of them to 0.14.0.
> > > >
> > > > I have two PRs in progress that I would appreciate reviews on:
> > > >
> > > > https://github.com/apache/arrow/pull/3671 - [Rust] Table API (a.k.a
> > > > DataFrame)
> > > >
> > > > https://github.com/apache/arrow/pull/3851 - [Rust] Parquet data
> source
> > > in
> > > > DataFusion
> > > >
> > > > Once these are merged I have some small follow up PRs for 0.13.0
> that I
> > > can
> > > > get done this week.
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > > >
> > > > On Tue, Mar 12, 2019 at 8:21 AM Wes McKinney 
> > > wrote:
> > > >
> > > > > hi folks,
> > > > >
> > > > > I think we are on track to be able to release toward the end of
> this
> > > > > month. My proposed timeline:
> > > > >
> > > > > * This week (March 11-15): feature/improvement push mostly
> > > > > * Next week (March 18-22): shift to bug fixes, stabilization, empty
> > > > > backlog of feature/improvement JIRAs
> > > > > * Week of March 25: propose release candidate
> > > > >
> > > > > Does this seem reasonable? This puts us at about 9-10 weeks from
> 0.12.
> > > > >
> > > > > We need an RM for 0.13, any PMCs want to volunteer?
> > > > >
> > > > > Take a look at our release page:
> > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103091219
> > > > >
> > > > > Out of the open or in-progress issues, we have:
> > > > >
> > > > > * C#: 3 issues
> > > > > * C++ (all components): 51 issues
> > > > > * Java: 3 issues
> > > > > * Python: 38 issues
> > > > > * Rust (all components): 33 issues
> > > > >
> > > > > Please help curating the backlogs for each component. There's a
> > > > > smattering of issues in other categories. There are also 10 open
> > > > > issues with No Component (and 20 resolved issues), those need their
> > > > > metadata fixed.
> > > > >
> > > > > Thanks,
> > > > > Wes
> > > > >
> > > > > On Wed, Feb 27, 2019 at 1:49 PM Wes McKinney 
> > > wrote:
> > > > > >
> > > > > > The timeline for the 0.13 release is drawing closer. I would say
> we
> > > > > > should consider a release candidate either the week of March 18
> or
> > > > > > March 25, which gives us ~3 weeks to close out backlog items.
> > > > > >
> > > > > > There are around 220 issues open or in-progress in
> > > > > >
> > > > > >
> > > https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.13.0+Release
> > > > > >
> > > > > > Please have a look. If issues are not assigned to someone as the
> next
> > > > > > couple of weeks pass by I'll begin moving at least C++ and Python
> > > > > > issues to 0.14 that don't seem like they're going to get done for
> > > > > > 0.13. If development stakeholders for C#, Java, Rust, Ruby, and
> other
> > > > > > components can review and curate the issues that would be
> helpful.
> > > > > >
> > > > > > You can help keep the JIRA issues tidy by making sure to add Fix
> > > > > > Version to issues and to make sure to add a Component so that
> issues
> > > > > > are properly categorized in the release notes.
> > > > > >
> > > > > > Thanks
> > > > > > Wes
> > > > > >
> > > > > > On Sat, Feb 9, 2019 at 10:39 AM Wes McKinney <
> wesmck...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > See
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
> > > > > > >
> > > > > > > The source release step is one of the places where problems
> occur.
> > > > > > >
> > > > > > > On Sat, Feb 9, 2019, 10:33 AM  > > > > > >>
> > > > > > >>
> > > > > > >> > On Feb 8, 2019, at 9:19 AM, Uwe L. Korn 
> > > wrote:
> > > > > > >> >
> > > > > > >> > We could dockerize some of the release steps to ensure that
> they
> > > > > run in the same environment.
> > > > > > >>
> > > > > > >> I may be able to help with said Dockerization. If not 

[jira] [Created] (ARROW-4873) [C++] ARROW_DEPENDENCY_SOURCE should not be overridden to CONDA if ARROW_PACKAGE_PREFIX is set by user

2019-03-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4873:
---

 Summary: [C++] ARROW_DEPENDENCY_SOURCE should not be overridden to 
CONDA if ARROW_PACKAGE_PREFIX is set by user
 Key: ARROW-4873
 URL: https://issues.apache.org/jira/browse/ARROW-4873
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.13.0


I use conda to manage Python dependencies but keep my C++ toolchain in a 
separate directory. This organizational scheme is incompatible with the new 
options after the CMake refactor

I think if you pass {{-DARROW_PREFIX_PATH=$MY_CPP_TOOLCHAIN}} then this should 
not be overridden with {{$CONDA_PREFIX}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4872) [Python] Keep backward compatibility for ParquetDatasetPiece

2019-03-14 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-4872:
--

 Summary: [Python] Keep backward compatibility for 
ParquetDatasetPiece
 Key: ARROW-4872
 URL: https://issues.apache.org/jira/browse/ARROW-4872
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Krisztian Szucs
 Fix For: 0.13.0


See 
https://github.com/apache/arrow/commit/f2fb02b82b60ba9a90c8bad6e5b11e37fc3ea9d3#r32722497

and 

https://github.com/dask/dask/pull/4587



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4870) ruby gemspec has wrong msys2 dependency listed

2019-03-14 Thread Dominic Sisneros (JIRA)
Dominic Sisneros created ARROW-4870:
---

 Summary:  ruby gemspec has wrong msys2 dependency listed
 Key: ARROW-4870
 URL: https://issues.apache.org/jira/browse/ARROW-4870
 Project: Apache Arrow
  Issue Type: Bug
  Components: Ruby
Affects Versions: 0.12.1
Reporter: Dominic Sisneros
 Fix For: 0.13.0


 ruby gemspec has wrong msys2 dependency listed

change mys2_mingw_dependencies to correct package

pacman -Ss arrow
mingw32/mingw-w64-i686-arrow 0.11.1-1
Apache Arrow is a cross-language development platform for in-memory data 
(mingw-w64)
mingw64/mingw-w64-x86_64-arrow 0.11.1-1 [installed]
Apache Arrow is a cross-language development platform for in-memory data 
(mingw-w64)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss][Java, Non-C++ generally] Support for 64-bit int array lengths?

2019-03-14 Thread Wes McKinney
hi Micah,

Given the constraints from Netty in Java, I would say that it makes
sense to raise an exception if encountering a Field length exceeding
2^31 - 1 in length (I think there are already some checks, but we can
add more checks during the IPC metadata read pass). With shared memory
/ zero copy in Java happening _eventually_
(https://issues.apache.org/jira/browse/ARROW-3191) this is becoming
more of a realistic issue, since someone may produce a massive dataset
and then try to read it in Java.

64-bit variable-size offsets (i.e. LargeList, LargeBinary /
LargeString) are a different matter. A list or varbinary vector could
have 64-bit offsets, that should not cause any issues. We need these
in C++ to unblock some real-world use cases with embedding large
objects in Arrow data structures and reading them from shared memory
with zero copy. If an implementation is unable to read such huge data
structures due to structural limitations we need only document this.

- Wes

On Thu, Mar 14, 2019 at 4:41 AM Ravindra Pindikura  wrote:
>
> @Jacques Nadeau  would have more background on this.
> Here's my understanding :
>
> On Thu, Mar 14, 2019 at 12:08 PM Micah Kornfield 
> wrote:
>
> > I was working on a proof of concept java implementation for LargeList  [1]
> > implementation (64-bit array offsets).  Our Java implementation doesn't
> > appear to support Vectors/Arrays larger then Integer.MAX_VALUE addressable
> > space.
> >
> > It looks like Message.fbs was updated quite a while ago to support 64-bit
> > lengths/offsets [2].  I had some questions:
> >
> > 1.  For Java:
> >   * Is my assessment accurate that is doesn't support 64-bit ranged sizes?
> >
>
> yes.
>
>
> >   * Is there a desire to support the 64 bit sizes? (I didn't come across
> > any JIRAs when I did a search)
> >
>
> no, afaik.
>
>
> >  *  Is there a technical blocker for doing so?
> >
>
> - big change
> - arrow uses the netty allocator. that also uses int (32-bit) for capacity.
>
> https://netty.io/4.0/xref/io/netty/buffer/ByteBufAllocator.html#84
>
>
>  * Any thoughts on approach for doing such a large change (I'm mostly
> > concerned with breaking existing consumers/performance regressions)?
> >- Given that the Java code base appears relatively stable, it might be
> > that forking and creating a version "2.0" is the best viable option.
> >
> > 2.  For other language implementations, is there support for 64-bit sizes
> > or only 32-bit?
> >
> > Thanks,
> > Micah
> >
> > P.S. It looks like our spec docs are out of date in regards to this issue,
> > they still list Int::MAX_VALUE as the largest possible array, it is on my
> > plate to update and consolidate them.
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-4810
> > [2]
> >
> > https://github.com/apache/arrow/commit/ced9d766d70e84c4d0542c6f5d9bd57faf10781d
> >
>
>
> --
> Thanks and regards,
> Ravindra.


[jira] [Created] (ARROW-4871) [Flight][Java] Handle large Flight messages

2019-03-14 Thread David Li (JIRA)
David Li created ARROW-4871:
---

 Summary: [Flight][Java] Handle large Flight messages
 Key: ARROW-4871
 URL: https://issues.apache.org/jira/browse/ARROW-4871
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Java
Reporter: David Li
Assignee: David Li
 Fix For: 0.14.0


Similarly to ARROW-4421, Java/gRPC needs to be configured to allow large 
messages. The integration tests should also be updated to cover this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4869) [C++] Use of gmock fails in compute/kernels/util-internal-test.cc

2019-03-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4869:
---

 Summary: [C++] Use of gmock fails in 
compute/kernels/util-internal-test.cc 
 Key: ARROW-4869
 URL: https://issues.apache.org/jira/browse/ARROW-4869
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.13.0


Out of the box build with 

{code}
cmake .. -DARROW_BUILD_TESTS=ON -DARROW_BUILD_BENCHMARKS=ON -DARROW_PARQUET=ON 
-DARROW_GANDIVA=ON -DARROW_FLIGHT=ON -DARROW_BOOST_VENDORED=ON
{code}

{code}
[ 57%] Building CXX object 
src/arrow/ipc/CMakeFiles/arrow-ipc-json-simple-test.dir/json-simple-test.cc.o
/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc:152:3:
 error: reference to non-static member function must be called; did you mean to 
call it with no arguments?
  
((mock).gmock_out_type).InternalExpectedAt("/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc",
 152, "mock", "out_type").WillRepeatedly(Return(boolean()));
  ^~~
 ()
/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc:173:3:
 error: reference to non-static member function must be called; did you mean to 
call it with no arguments?
  
((mock).gmock_out_type).InternalExpectedAt("/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc",
 173, "mock", "out_type").WillRepeatedly(Return(int32()));
  ^~~
 ()
/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc:192:3:
 error: reference to non-static member function must be called; did you mean to 
call it with no arguments?
  
((mock).gmock_out_type).InternalExpectedAt("/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc",
 192, "mock", "out_type").WillRepeatedly(Return(boolean()));
  ^~~
 ()
/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc:213:3:
 error: reference to non-static member function must be called; did you mean to 
call it with no arguments?
  
((mock).gmock_out_type).InternalExpectedAt("/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util-internal-test.cc",
 213, "mock", "out_type").WillRepeatedly(Return(int32()));
  ^~~
 ()
4 errors generated.
make[2]: *** 
[src/arrow/compute/kernels/CMakeFiles/arrow-compute-util-internal-test.dir/util-internal-test.cc.o]
 Error 1
make[1]: *** 
[src/arrow/compute/kernels/CMakeFiles/arrow-compute-util-internal-test.dir/all] 
Error 2
make[1]: *** Waiting for unfinished jobs
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4868) [C++][Gandiva] Build fails with system Boost on Ubuntu Trusty 14.04

2019-03-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4868:
---

 Summary: [C++][Gandiva] Build fails with system Boost on Ubuntu 
Trusty 14.04
 Key: ARROW-4868
 URL: https://issues.apache.org/jira/browse/ARROW-4868
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Wes McKinney
 Fix For: 0.14.0


It would be nice for things to work out of the box, but maybe not worth it. I 
can use vendored Boost for now

{code}
/usr/include/boost/functional/hash/extensions.hpp:269:20: error: no matching 
function for call to 'hash_value'
return hash_value(val);
   ^~
/usr/include/boost/functional/hash/hash.hpp:249:17: note: in instantiation of 
member function 'boost::hash 
>::operator()' requested here
seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
^
/home/wesm/code/arrow/cpp/src/gandiva/filter_cache_key.h:40:12: note: in 
instantiation of function template specialization 
'boost::hash_combine >' requested here
boost::hash_combine(result, configuration);
   ^
/usr/include/boost/functional/hash/extensions.hpp:70:17: note: candidate 
template ignored: could not match 'pair' against 'shared_ptr'
std::size_t hash_value(std::pair const& v)
^
/usr/include/boost/functional/hash/extensions.hpp:79:17: note: candidate 
template ignored: could not match 'vector' against 'shared_ptr'
std::size_t hash_value(std::vector const& v)
^
/usr/include/boost/functional/hash/extensions.hpp:85:17: note: candidate 
template ignored: could not match 'list' against 'shared_ptr'
std::size_t hash_value(std::list const& v)
^
/usr/include/boost/functional/hash/extensions.hpp:91:17: note: candidate 
template ignored: could not match 'deque' against 'shared_ptr'
std::size_t hash_value(std::deque const& v)
^
/usr/include/boost/functional/hash/extensions.hpp:97:17: note: candidate 
template ignored: could not match 'set' against 'shared_ptr'
std::size_t hash_value(std::set const& v)
^
/usr/include/boost/functional/hash/extensions.hpp:103:17: note: candidate 
template ignored: could not match 'multiset' against 'shared_ptr'
std::size_t hash_value(std::multiset const& v)
^
/usr/include/boost/functional/hash/extensions.hpp:109:17: note: candidate 
template ignored: could not match 'map' against 'shared_ptr'
std::size_t hash_value(std::map const& v)
^
/usr/include/boost/functional/hash/extensions.hpp:115:17: note: candidate 
template ignored: could not match 'multimap' against 'shared_ptr'
std::size_t hash_value(std::multimap const& v)
^
/usr/include/boost/functional/hash/extensions.hpp:121:17: note: candidate 
template ignored: could not match 'complex' against 'shared_ptr'
std::size_t hash_value(std::complex const& v)
^
/usr/include/boost/functional/hash/hash.hpp:187:57: note: candidate template 
ignored: substitution failure [with T = 
std::shared_ptr]: no type named 'type' in 
'boost::hash_detail::basic_numbers >'
typename boost::hash_detail::basic_numbers::type hash_value(T v)
    ^
/usr/include/boost/functional/hash/hash.hpp:193:56: note: candidate template 
ignored: substitution failure [with T = 
std::shared_ptr]: no type named 'type' in 
'boost::hash_detail::long_numbers >'
typename boost::hash_detail::long_numbers::type hash_value(T v)
   ^
/usr/include/boost/functional/hash/hash.hpp:199:57: note: candidate template 
ignored: substitution failure [with T = 
std::shared_ptr]: no type named 'type' in 
'boost::hash_detail::ulong_numbers >'
typename boost::hash_detail::ulong_numbers::type hash_value(T v)
    ^
/usr/include/boost/functional/hash/hash.hpp:205:31: note: candidate template 
ignored: disabled by 'enable_if' [with T = 
std::shared_ptr]
typename boost::enable_if, std::size_t>::type
  ^
/usr/include/boost/functional/hash/hash.hpp:213:36: note: candidate template 
ignored: could not match 'T *const' against 'const 
std::shared_ptr'
template  std::size_t hash_value(T* const& v)
   ^
/usr/include/boost/functional/hash/hash.hpp:306:24: note: candidate template 
ignored: could not match 'const T [N]' against 'const 
std::shared_ptr'
inline std::size_t hash_value(const T ()[N])
   ^
/usr/include/boost/functional/hash/hash.hpp:312:24: note: candidate template 
ignored: could not match 'T [N]' against 'const 
std::shared_ptr'
inline std::size_t hash_value(T ()[N])
   ^
/usr/include/boost/functional/hash/hash.hpp:319:24: note: candidate template 
ignored: could not match 'basic_string' 

[jira] [Created] (ARROW-4866) [C++] zstd ExternalProject failing on Windows

2019-03-14 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4866:
--

 Summary: [C++] zstd ExternalProject failing on Windows
 Key: ARROW-4866
 URL: https://issues.apache.org/jira/browse/ARROW-4866
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Packaging
Reporter: Uwe L. Korn
 Fix For: 0.13.0


After 
[https://github.com/apache/arrow/pull/3885|https://github.com/apache/arrow/pull/3885,]
 the zstd ExternalProject is failing in the Windows builds, see 
[https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/23063072/job/bd0gom16atlkddtx]

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Call for presentations - ApacheCon North America

2019-03-14 Thread Zoran Regvart
Hi Apache developers,
(apologies if you're receiving this e-mail multiple times on different
dev@ lists)

I'd like to draw your attention to call for presentations that is now
open for ApacheCon North America 2019 -- marking the 20 year
anniversary of ASF; that will be held in Las Vegas this September.

I'm sure that you're aware of the ever increasing need for integrating
different software systems, be it cloud/SaaS offerings or in-house
custom applications and databases. I find that the field of system
integration and the open source projects at ASF helping with that
issue a great topic of interest. So I've selected, to the best of my
ability, your project as one that is applicable in this area.

I'm chairing a whole day track Integration track focusing on system
integration projects at ASF and I would like to invite you to talk
about your project. I think this is a great venue to present and get
in touch with the wider ASF community. I'm specifically looking
forward to talks dealing with the state of the project and customer
success stories.

So please submit your proposals at:

https://apachecon.com/acna19/index.html (click on the IS NOW OPEN!).

The submission deadline is Monday, May 13.

zoran
-- 
Zoran Regvart


Re: [Discuss][Java, Non-C++ generally] Support for 64-bit int array lengths?

2019-03-14 Thread Ravindra Pindikura
@Jacques Nadeau  would have more background on this.
Here's my understanding :

On Thu, Mar 14, 2019 at 12:08 PM Micah Kornfield 
wrote:

> I was working on a proof of concept java implementation for LargeList  [1]
> implementation (64-bit array offsets).  Our Java implementation doesn't
> appear to support Vectors/Arrays larger then Integer.MAX_VALUE addressable
> space.
>
> It looks like Message.fbs was updated quite a while ago to support 64-bit
> lengths/offsets [2].  I had some questions:
>
> 1.  For Java:
>   * Is my assessment accurate that is doesn't support 64-bit ranged sizes?
>

yes.


>   * Is there a desire to support the 64 bit sizes? (I didn't come across
> any JIRAs when I did a search)
>

no, afaik.


>  *  Is there a technical blocker for doing so?
>

- big change
- arrow uses the netty allocator. that also uses int (32-bit) for capacity.

https://netty.io/4.0/xref/io/netty/buffer/ByteBufAllocator.html#84


 * Any thoughts on approach for doing such a large change (I'm mostly
> concerned with breaking existing consumers/performance regressions)?
>- Given that the Java code base appears relatively stable, it might be
> that forking and creating a version "2.0" is the best viable option.
>
> 2.  For other language implementations, is there support for 64-bit sizes
> or only 32-bit?
>
> Thanks,
> Micah
>
> P.S. It looks like our spec docs are out of date in regards to this issue,
> they still list Int::MAX_VALUE as the largest possible array, it is on my
> plate to update and consolidate them.
>
> [1] https://issues.apache.org/jira/browse/ARROW-4810
> [2]
>
> https://github.com/apache/arrow/commit/ced9d766d70e84c4d0542c6f5d9bd57faf10781d
>


-- 
Thanks and regards,
Ravindra.


[jira] [Created] (ARROW-4865) [Rust] Support casting lists and primitives to lists

2019-03-14 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-4865:
-

 Summary: [Rust] Support casting lists and primitives to lists
 Key: ARROW-4865
 URL: https://issues.apache.org/jira/browse/ARROW-4865
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 0.12.0
Reporter: Neville Dipale


This adds support for casting between list arrays and from primitive arrays to 
single-value list arrays



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[Discuss][Java, Non-C++ generally] Support for 64-bit int array lengths?

2019-03-14 Thread Micah Kornfield
I was working on a proof of concept java implementation for LargeList  [1]
implementation (64-bit array offsets).  Our Java implementation doesn't
appear to support Vectors/Arrays larger then Integer.MAX_VALUE addressable
space.

It looks like Message.fbs was updated quite a while ago to support 64-bit
lengths/offsets [2].  I had some questions:

1.  For Java:
  * Is my assessment accurate that is doesn't support 64-bit ranged sizes?
  * Is there a desire to support the 64 bit sizes? (I didn't come across
any JIRAs when I did a search)
 *  Is there a technical blocker for doing so?
 * Any thoughts on approach for doing such a large change (I'm mostly
concerned with breaking existing consumers/performance regressions)?
   - Given that the Java code base appears relatively stable, it might be
that forking and creating a version "2.0" is the best viable option.

2.  For other language implementations, is there support for 64-bit sizes
or only 32-bit?

Thanks,
Micah

P.S. It looks like our spec docs are out of date in regards to this issue,
they still list Int::MAX_VALUE as the largest possible array, it is on my
plate to update and consolidate them.

[1] https://issues.apache.org/jira/browse/ARROW-4810
[2]
https://github.com/apache/arrow/commit/ced9d766d70e84c4d0542c6f5d9bd57faf10781d