[jira] [Created] (ARROW-9033) [Python] Add tests to verify that one can build a C++ extension against the manylinux1 wheels
Wes McKinney created ARROW-9033: --- Summary: [Python] Add tests to verify that one can build a C++ extension against the manylinux1 wheels Key: ARROW-9033 URL: https://issues.apache.org/jira/browse/ARROW-9033 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Some project want to be able to use the Python wheels to build other Python packages with C++ extensions that need to link against libarrow.so. It would be great if someone would add automated tests to ensure that our wheel builds can be used successfully in this fashion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9032) [C++] Split arrow/util/bit_util.h into multiple header files
Wes McKinney created ARROW-9032: --- Summary: [C++] Split arrow/util/bit_util.h into multiple header files Key: ARROW-9032 URL: https://issues.apache.org/jira/browse/ARROW-9032 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This header has grown quite large and any given compilation unit's use of it is likely limited to only a couple of functions or classes. I suspect it would improve compilation time to split up this header into a few headers organized by frequency of code use. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9031) [R] Implement conversion from Type::UINT64 to R vector
Wes McKinney created ARROW-9031: --- Summary: [R] Implement conversion from Type::UINT64 to R vector Key: ARROW-9031 URL: https://issues.apache.org/jira/browse/ARROW-9031 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Wes McKinney Fix For: 1.0.0 This case is not handled in array_to_vector.cpp -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9030) [Python] Clean up some usages of pyarrow.compat, move some common functions/symbols to lib.pyx
Wes McKinney created ARROW-9030: --- Summary: [Python] Clean up some usages of pyarrow.compat, move some common functions/symbols to lib.pyx Key: ARROW-9030 URL: https://issues.apache.org/jira/browse/ARROW-9030 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney I started doing this while looking into ARROW-4633 -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Add kernel integer overflow handling
On Wed, Jun 3, 2020 at 11:25 AM Krisztián Szűcs wrote: > > On Wed, Jun 3, 2020 at 6:16 PM Krisztián Szűcs > wrote: > > > > On Wed, Jun 3, 2020 at 5:52 PM Wes McKinney wrote: > > > > > > On Wed, Jun 3, 2020 at 10:49 AM Krisztián Szűcs > > > wrote: > > > > > > > > From the user perspective I find the following pretty confusing: > > > > > > > > In [1]: np.array([-128, 127], dtype=np.int8()) * 2 > > > > Out[1]: array([ 0, -2], dtype=int8) > > > > > > > > In [2]: np.array([-128, 127], dtype=np.int16()) * 2 > > > > Out[2]: array([-256, 254], dtype=int16) > > > > > > > > In my opinion somewhere (on a higher level maybe) we should provide > > > > the correct results promoted to a wider type implicitly. > > > > > > Yes, I agree with you, but I agree that the best place to address this > > > is at a higher level rather than having this logic implemented at the > > > lowest level (kernels) -- I think database systems handle this during > > > logical->physical planning. > > > > It raises another question: where to incorporate the implicit promotions? > > // correct me if I'm wrong but these implicit promotions are operation > > // dependent and distinct from kernel dispatching issue [1] > > > > The numpy example above can be roughly translated to: > > >>> a = pa.array([-128, 127]) > > >>> pa.compute.add(a, a) > > array([ 0, -2] > > > > Which is rather surprising from the user's perspective. > > Would it be enough to document the exact behavior and advice the user > to place casts until we have logical -> phisycal machinery? I think it's enough to clearly document the behavior and assume that the "user" will act according to what semantics are desired for their use cases. Per my comments in my last e-mail I don't think the users of these functions need to be handled with "kid's gloves". > I'm updating my PR as discussed. > > > > [1] https://issues.apache.org/jira/browse/ARROW-8919 > > > > > > > Clickhouse for example does the type promotion. > > > > > > > > On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou > > > > wrote: > > > > > > > > > > On Wed, 3 Jun 2020 10:47:38 -0400 > > > > > Ben Kietzman wrote: > > > > > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193 > > > > > > > > > > > > How should arithmetic kernels handle integer overflow? > > > > > > > > > > > > The approach currently taken in the linked PR is to promote such > > > > > > that > > > > > > overflow will not occur, for example `(int8, int8)->int16` and > > > > > > `(uint16, > > > > > > uint16)->uint32`. > > > > > > > > > > > > I'm not sure that's desirable. For one thing this leads to > > > > > > inconsistent > > > > > > handling of 64 bit integer types, which are currently allowed to > > > > > > overflow > > > > > > since we cannot promote further (NB: that means this kernel includes > > > > > > undefined behavior for int64). > > > > > > > > > > I agree with you. I would strongly advise against implicit promotion > > > > > accross arithmetic operations. We initially did that in Numba and it > > > > > quickly became a can of worms. > > > > > > > > > > The most desirable behaviour IMHO is to keep the original type, so: > > > > > - (int8, int8) -> int8 > > > > > - (uint16, uint16) -> uint16 > > > > > > > > > > Then the question is what happens when the actual overflow occurs. I > > > > > think this should be directed by a kernel option. By default an error > > > > > should probably be raised (letting errors pass and silently produce > > > > > erroneous data is wrong), but we might want to allow people to bypass > > > > > overflow checks for speed. > > > > > > > > > > Note that even if overflow detection is enabled, it *should* be > > > > > possible > > > > > to enable vectorization, e.g. by making overflow detection a separate > > > > > pass (itself vectorizable). > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > > > > > > > >
Re: [DISCUSS] Add kernel integer overflow handling
On Wed, Jun 3, 2020 at 11:16 AM Krisztián Szűcs wrote: > > On Wed, Jun 3, 2020 at 5:52 PM Wes McKinney wrote: > > > > On Wed, Jun 3, 2020 at 10:49 AM Krisztián Szűcs > > wrote: > > > > > > From the user perspective I find the following pretty confusing: > > > > > > In [1]: np.array([-128, 127], dtype=np.int8()) * 2 > > > Out[1]: array([ 0, -2], dtype=int8) > > > > > > In [2]: np.array([-128, 127], dtype=np.int16()) * 2 > > > Out[2]: array([-256, 254], dtype=int16) > > > > > > In my opinion somewhere (on a higher level maybe) we should provide > > > the correct results promoted to a wider type implicitly. > > > > Yes, I agree with you, but I agree that the best place to address this > > is at a higher level rather than having this logic implemented at the > > lowest level (kernels) -- I think database systems handle this during > > logical->physical planning. > > It raises another question: where to incorporate the implicit promotions? > // correct me if I'm wrong but these implicit promotions are operation > // dependent and distinct from kernel dispatching issue [1] Implicit promotions would be handled when translating a logical expression (like "Add(x, y)") to a physical bound expression (something like "add_kernel[int16, int16](cast(x, int16), cast(y, int16))", where x and y are int8). > > The numpy example above can be roughly translated to: > >>> a = pa.array([-128, 127]) > >>> pa.compute.add(a, a) > array([ 0, -2] > > Which is rather surprising from the user's perspective. I guess we have to qualify what is a "user". The direct "user" of the kernels is not the same as a user who would write a SQL query or use a data frame library. I believe the contents of pyarrow.compute (and most of pyarrow, FWIW) are intended for system developers not end users (e.g. "people who use pandas"). > > [1] https://issues.apache.org/jira/browse/ARROW-8919 > > > > > Clickhouse for example does the type promotion. > > > > > > On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou wrote: > > > > > > > > On Wed, 3 Jun 2020 10:47:38 -0400 > > > > Ben Kietzman wrote: > > > > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193 > > > > > > > > > > How should arithmetic kernels handle integer overflow? > > > > > > > > > > The approach currently taken in the linked PR is to promote such that > > > > > overflow will not occur, for example `(int8, int8)->int16` and > > > > > `(uint16, > > > > > uint16)->uint32`. > > > > > > > > > > I'm not sure that's desirable. For one thing this leads to > > > > > inconsistent > > > > > handling of 64 bit integer types, which are currently allowed to > > > > > overflow > > > > > since we cannot promote further (NB: that means this kernel includes > > > > > undefined behavior for int64). > > > > > > > > I agree with you. I would strongly advise against implicit promotion > > > > accross arithmetic operations. We initially did that in Numba and it > > > > quickly became a can of worms. > > > > > > > > The most desirable behaviour IMHO is to keep the original type, so: > > > > - (int8, int8) -> int8 > > > > - (uint16, uint16) -> uint16 > > > > > > > > Then the question is what happens when the actual overflow occurs. I > > > > think this should be directed by a kernel option. By default an error > > > > should probably be raised (letting errors pass and silently produce > > > > erroneous data is wrong), but we might want to allow people to bypass > > > > overflow checks for speed. > > > > > > > > Note that even if overflow detection is enabled, it *should* be possible > > > > to enable vectorization, e.g. by making overflow detection a separate > > > > pass (itself vectorizable). > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > >
Re: [DISCUSS] Add kernel integer overflow handling
On Wed, Jun 3, 2020 at 6:16 PM Krisztián Szűcs wrote: > > On Wed, Jun 3, 2020 at 5:52 PM Wes McKinney wrote: > > > > On Wed, Jun 3, 2020 at 10:49 AM Krisztián Szűcs > > wrote: > > > > > > From the user perspective I find the following pretty confusing: > > > > > > In [1]: np.array([-128, 127], dtype=np.int8()) * 2 > > > Out[1]: array([ 0, -2], dtype=int8) > > > > > > In [2]: np.array([-128, 127], dtype=np.int16()) * 2 > > > Out[2]: array([-256, 254], dtype=int16) > > > > > > In my opinion somewhere (on a higher level maybe) we should provide > > > the correct results promoted to a wider type implicitly. > > > > Yes, I agree with you, but I agree that the best place to address this > > is at a higher level rather than having this logic implemented at the > > lowest level (kernels) -- I think database systems handle this during > > logical->physical planning. > > It raises another question: where to incorporate the implicit promotions? > // correct me if I'm wrong but these implicit promotions are operation > // dependent and distinct from kernel dispatching issue [1] > > The numpy example above can be roughly translated to: > >>> a = pa.array([-128, 127]) > >>> pa.compute.add(a, a) > array([ 0, -2] > > Which is rather surprising from the user's perspective. Would it be enough to document the exact behavior and advice the user to place casts until we have logical -> phisycal machinery? I'm updating my PR as discussed. > > [1] https://issues.apache.org/jira/browse/ARROW-8919 > > > > > Clickhouse for example does the type promotion. > > > > > > On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou wrote: > > > > > > > > On Wed, 3 Jun 2020 10:47:38 -0400 > > > > Ben Kietzman wrote: > > > > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193 > > > > > > > > > > How should arithmetic kernels handle integer overflow? > > > > > > > > > > The approach currently taken in the linked PR is to promote such that > > > > > overflow will not occur, for example `(int8, int8)->int16` and > > > > > `(uint16, > > > > > uint16)->uint32`. > > > > > > > > > > I'm not sure that's desirable. For one thing this leads to > > > > > inconsistent > > > > > handling of 64 bit integer types, which are currently allowed to > > > > > overflow > > > > > since we cannot promote further (NB: that means this kernel includes > > > > > undefined behavior for int64). > > > > > > > > I agree with you. I would strongly advise against implicit promotion > > > > accross arithmetic operations. We initially did that in Numba and it > > > > quickly became a can of worms. > > > > > > > > The most desirable behaviour IMHO is to keep the original type, so: > > > > - (int8, int8) -> int8 > > > > - (uint16, uint16) -> uint16 > > > > > > > > Then the question is what happens when the actual overflow occurs. I > > > > think this should be directed by a kernel option. By default an error > > > > should probably be raised (letting errors pass and silently produce > > > > erroneous data is wrong), but we might want to allow people to bypass > > > > overflow checks for speed. > > > > > > > > Note that even if overflow detection is enabled, it *should* be possible > > > > to enable vectorization, e.g. by making overflow detection a separate > > > > pass (itself vectorizable). > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > >
Re: [DISCUSS] Add kernel integer overflow handling
On Wed, Jun 3, 2020 at 5:52 PM Wes McKinney wrote: > > On Wed, Jun 3, 2020 at 10:49 AM Krisztián Szűcs > wrote: > > > > From the user perspective I find the following pretty confusing: > > > > In [1]: np.array([-128, 127], dtype=np.int8()) * 2 > > Out[1]: array([ 0, -2], dtype=int8) > > > > In [2]: np.array([-128, 127], dtype=np.int16()) * 2 > > Out[2]: array([-256, 254], dtype=int16) > > > > In my opinion somewhere (on a higher level maybe) we should provide > > the correct results promoted to a wider type implicitly. > > Yes, I agree with you, but I agree that the best place to address this > is at a higher level rather than having this logic implemented at the > lowest level (kernels) -- I think database systems handle this during > logical->physical planning. It raises another question: where to incorporate the implicit promotions? // correct me if I'm wrong but these implicit promotions are operation // dependent and distinct from kernel dispatching issue [1] The numpy example above can be roughly translated to: >>> a = pa.array([-128, 127]) >>> pa.compute.add(a, a) array([ 0, -2] Which is rather surprising from the user's perspective. [1] https://issues.apache.org/jira/browse/ARROW-8919 > > > Clickhouse for example does the type promotion. > > > > On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou wrote: > > > > > > On Wed, 3 Jun 2020 10:47:38 -0400 > > > Ben Kietzman wrote: > > > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193 > > > > > > > > How should arithmetic kernels handle integer overflow? > > > > > > > > The approach currently taken in the linked PR is to promote such that > > > > overflow will not occur, for example `(int8, int8)->int16` and `(uint16, > > > > uint16)->uint32`. > > > > > > > > I'm not sure that's desirable. For one thing this leads to inconsistent > > > > handling of 64 bit integer types, which are currently allowed to > > > > overflow > > > > since we cannot promote further (NB: that means this kernel includes > > > > undefined behavior for int64). > > > > > > I agree with you. I would strongly advise against implicit promotion > > > accross arithmetic operations. We initially did that in Numba and it > > > quickly became a can of worms. > > > > > > The most desirable behaviour IMHO is to keep the original type, so: > > > - (int8, int8) -> int8 > > > - (uint16, uint16) -> uint16 > > > > > > Then the question is what happens when the actual overflow occurs. I > > > think this should be directed by a kernel option. By default an error > > > should probably be raised (letting errors pass and silently produce > > > erroneous data is wrong), but we might want to allow people to bypass > > > overflow checks for speed. > > > > > > Note that even if overflow detection is enabled, it *should* be possible > > > to enable vectorization, e.g. by making overflow detection a separate > > > pass (itself vectorizable). > > > > > > Regards > > > > > > Antoine. > > > > > >
[jira] [Created] (ARROW-9029) [C++] Implement BitmapScanner interface to accelerate processing of mostly-not-null data
Wes McKinney created ARROW-9029: --- Summary: [C++] Implement BitmapScanner interface to accelerate processing of mostly-not-null data Key: ARROW-9029 URL: https://issues.apache.org/jira/browse/ARROW-9029 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 In analytics, it is common for data to be all not-null or mostly not-null. Data with > 50% nulls tends to be more exceptional. In this might, our {{BitmapReader}} class which allows iteration of each bit in a bitmap can be wasteful for mostly set validity bitmaps. I propose instead a new interface for use in kernel implementations, for lack of a better term {{BitmapScanner}}. This works as follows: * Uses popcount to accumulate consecutive 64-bit words from a bitmap where all values are set, up to some limit (e.g. anywhere from 8 to 128 words -- we can use benchmarks to determine what is a good limit). The length of this "all-on" run is returned to the caller in a single function call, so that this "run" of data can be processed without any bit-by-bit bitmap checking * If words containing unset bits is encountered, the scanner will similarly accumulate non-full words until the next full word is encountered or a limit is hit. The length of this "has nulls" run is returned to the caller, which then proceeds bit-by-bit to process the data For data with a lot of nulls, this may degrade performance somewhat but probably not that much empirically. However, data that is mostly-not-null should benefit from this. This BitmapScanner utility can probably also be used to accelerate the implementation of Filter for mostly-not-null data -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Add kernel integer overflow handling
On Wed, Jun 3, 2020 at 10:49 AM Krisztián Szűcs wrote: > > From the user perspective I find the following pretty confusing: > > In [1]: np.array([-128, 127], dtype=np.int8()) * 2 > Out[1]: array([ 0, -2], dtype=int8) > > In [2]: np.array([-128, 127], dtype=np.int16()) * 2 > Out[2]: array([-256, 254], dtype=int16) > > In my opinion somewhere (on a higher level maybe) we should provide > the correct results promoted to a wider type implicitly. Yes, I agree with you, but I agree that the best place to address this is at a higher level rather than having this logic implemented at the lowest level (kernels) -- I think database systems handle this during logical->physical planning. > Clickhouse for example does the type promotion. > > On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou wrote: > > > > On Wed, 3 Jun 2020 10:47:38 -0400 > > Ben Kietzman wrote: > > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193 > > > > > > How should arithmetic kernels handle integer overflow? > > > > > > The approach currently taken in the linked PR is to promote such that > > > overflow will not occur, for example `(int8, int8)->int16` and `(uint16, > > > uint16)->uint32`. > > > > > > I'm not sure that's desirable. For one thing this leads to inconsistent > > > handling of 64 bit integer types, which are currently allowed to overflow > > > since we cannot promote further (NB: that means this kernel includes > > > undefined behavior for int64). > > > > I agree with you. I would strongly advise against implicit promotion > > accross arithmetic operations. We initially did that in Numba and it > > quickly became a can of worms. > > > > The most desirable behaviour IMHO is to keep the original type, so: > > - (int8, int8) -> int8 > > - (uint16, uint16) -> uint16 > > > > Then the question is what happens when the actual overflow occurs. I > > think this should be directed by a kernel option. By default an error > > should probably be raised (letting errors pass and silently produce > > erroneous data is wrong), but we might want to allow people to bypass > > overflow checks for speed. > > > > Note that even if overflow detection is enabled, it *should* be possible > > to enable vectorization, e.g. by making overflow detection a separate > > pass (itself vectorizable). > > > > Regards > > > > Antoine. > > > >
Re: [DISCUSS] Add kernel integer overflow handling
>From the user perspective I find the following pretty confusing: In [1]: np.array([-128, 127], dtype=np.int8()) * 2 Out[1]: array([ 0, -2], dtype=int8) In [2]: np.array([-128, 127], dtype=np.int16()) * 2 Out[2]: array([-256, 254], dtype=int16) In my opinion somewhere (on a higher level maybe) we should provide the correct results promoted to a wider type implicitly. Clickhouse for example does the type promotion. On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou wrote: > > On Wed, 3 Jun 2020 10:47:38 -0400 > Ben Kietzman wrote: > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193 > > > > How should arithmetic kernels handle integer overflow? > > > > The approach currently taken in the linked PR is to promote such that > > overflow will not occur, for example `(int8, int8)->int16` and `(uint16, > > uint16)->uint32`. > > > > I'm not sure that's desirable. For one thing this leads to inconsistent > > handling of 64 bit integer types, which are currently allowed to overflow > > since we cannot promote further (NB: that means this kernel includes > > undefined behavior for int64). > > I agree with you. I would strongly advise against implicit promotion > accross arithmetic operations. We initially did that in Numba and it > quickly became a can of worms. > > The most desirable behaviour IMHO is to keep the original type, so: > - (int8, int8) -> int8 > - (uint16, uint16) -> uint16 > > Then the question is what happens when the actual overflow occurs. I > think this should be directed by a kernel option. By default an error > should probably be raised (letting errors pass and silently produce > erroneous data is wrong), but we might want to allow people to bypass > overflow checks for speed. > > Note that even if overflow detection is enabled, it *should* be possible > to enable vectorization, e.g. by making overflow detection a separate > pass (itself vectorizable). > > Regards > > Antoine. > >
Re: [DISCUSS] Add kernel integer overflow handling
On Wed, Jun 3, 2020 at 10:44 AM Wes McKinney wrote: > > > By default an error should probably be raised > > I would very strongly recommend keeping the behavior consistent with > that of analytic DBMSes. I don't think that most analytic DBMS error > on overflows because it's too computationally expensive to check. > NumPy doesn't error (by default at least) when you have overflows > either: > > >>> arr = np.array([128, 128, 128], dtype='i1') > >>> arr + arr > array([0, 0, 0], dtype=int8) *facepalm* >>> arr = np.array([127, 127, 127], dtype='i1') >>> arr + arr array([-2, -2, -2], dtype=int8) >>> > > On Wed, Jun 3, 2020 at 10:29 AM Antoine Pitrou wrote: > > > > On Wed, 3 Jun 2020 10:47:38 -0400 > > Ben Kietzman wrote: > > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193 > > > > > > How should arithmetic kernels handle integer overflow? > > > > > > The approach currently taken in the linked PR is to promote such that > > > overflow will not occur, for example `(int8, int8)->int16` and `(uint16, > > > uint16)->uint32`. > > > > > > I'm not sure that's desirable. For one thing this leads to inconsistent > > > handling of 64 bit integer types, which are currently allowed to overflow > > > since we cannot promote further (NB: that means this kernel includes > > > undefined behavior for int64). > > > > I agree with you. I would strongly advise against implicit promotion > > accross arithmetic operations. We initially did that in Numba and it > > quickly became a can of worms. > > > > The most desirable behaviour IMHO is to keep the original type, so: > > - (int8, int8) -> int8 > > - (uint16, uint16) -> uint16 > > > > Then the question is what happens when the actual overflow occurs. I > > think this should be directed by a kernel option. By default an error > > should probably be raised (letting errors pass and silently produce > > erroneous data is wrong), but we might want to allow people to bypass > > overflow checks for speed. > > > > Note that even if overflow detection is enabled, it *should* be possible > > to enable vectorization, e.g. by making overflow detection a separate > > pass (itself vectorizable). > > > > Regards > > > > Antoine. > > > >
Re: [DISCUSS] Add kernel integer overflow handling
> By default an error should probably be raised I would very strongly recommend keeping the behavior consistent with that of analytic DBMSes. I don't think that most analytic DBMS error on overflows because it's too computationally expensive to check. NumPy doesn't error (by default at least) when you have overflows either: >>> arr = np.array([128, 128, 128], dtype='i1') >>> arr + arr array([0, 0, 0], dtype=int8) On Wed, Jun 3, 2020 at 10:29 AM Antoine Pitrou wrote: > > On Wed, 3 Jun 2020 10:47:38 -0400 > Ben Kietzman wrote: > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193 > > > > How should arithmetic kernels handle integer overflow? > > > > The approach currently taken in the linked PR is to promote such that > > overflow will not occur, for example `(int8, int8)->int16` and `(uint16, > > uint16)->uint32`. > > > > I'm not sure that's desirable. For one thing this leads to inconsistent > > handling of 64 bit integer types, which are currently allowed to overflow > > since we cannot promote further (NB: that means this kernel includes > > undefined behavior for int64). > > I agree with you. I would strongly advise against implicit promotion > accross arithmetic operations. We initially did that in Numba and it > quickly became a can of worms. > > The most desirable behaviour IMHO is to keep the original type, so: > - (int8, int8) -> int8 > - (uint16, uint16) -> uint16 > > Then the question is what happens when the actual overflow occurs. I > think this should be directed by a kernel option. By default an error > should probably be raised (letting errors pass and silently produce > erroneous data is wrong), but we might want to allow people to bypass > overflow checks for speed. > > Note that even if overflow detection is enabled, it *should* be possible > to enable vectorization, e.g. by making overflow detection a separate > pass (itself vectorizable). > > Regards > > Antoine. > >
Re: [DISCUSS] Add kernel integer overflow handling
On Wed, 3 Jun 2020 10:47:38 -0400 Ben Kietzman wrote: > https://github.com/apache/arrow/pull/7341#issuecomment-638241193 > > How should arithmetic kernels handle integer overflow? > > The approach currently taken in the linked PR is to promote such that > overflow will not occur, for example `(int8, int8)->int16` and `(uint16, > uint16)->uint32`. > > I'm not sure that's desirable. For one thing this leads to inconsistent > handling of 64 bit integer types, which are currently allowed to overflow > since we cannot promote further (NB: that means this kernel includes > undefined behavior for int64). I agree with you. I would strongly advise against implicit promotion accross arithmetic operations. We initially did that in Numba and it quickly became a can of worms. The most desirable behaviour IMHO is to keep the original type, so: - (int8, int8) -> int8 - (uint16, uint16) -> uint16 Then the question is what happens when the actual overflow occurs. I think this should be directed by a kernel option. By default an error should probably be raised (letting errors pass and silently produce erroneous data is wrong), but we might want to allow people to bypass overflow checks for speed. Note that even if overflow detection is enabled, it *should* be possible to enable vectorization, e.g. by making overflow detection a separate pass (itself vectorizable). Regards Antoine.
Re: [DISCUSS] Add kernel integer overflow handling
What do open source analytic database systems do? I don't think we should deviate from the behavior of these systems. For example, you can see that Apache Impala uses unsigned arithmetic on signed integers https://github.com/apache/impala/blob/5c69e7ba583297dc886652ac5952816882b928af/be/src/exprs/operators-ir.cc#L38 On Wed, Jun 3, 2020 at 9:47 AM Ben Kietzman wrote: > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193 > > How should arithmetic kernels handle integer overflow? > > The approach currently taken in the linked PR is to promote such that > overflow will not occur, for example `(int8, int8)->int16` and `(uint16, > uint16)->uint32`. > > I'm not sure that's desirable. For one thing this leads to inconsistent > handling of 64 bit integer types, which are currently allowed to overflow > since we cannot promote further (NB: that means this kernel includes > undefined behavior for int64). > > There are a few other approaches we could take (ordered by personal > preference): > >- define explicit overflow behavior for signed integer operands (for >example if we declared that add(i8(a), i8(b)) will always be equivalent >to i8(i16(a) + i16(b)) then we could instantiate only unsigned addition >kernels) >- raise an error on signed overflow >- provide ArithmeticOptions::overflow_behavior and allow users to choose >between these >- require users to pass arguments which will not overflow
Re: Writing Parquet datasets using pyarrow.parquet.ParquetWriter
Hi Palak, The ParquetWriter class is meant to write a single parquet file (so in that sense, that you see only a single parquet file being written based on the shown code, that is expected). If you want to write multiple files, you can either manually create multiple ParquetWriter instances (each with a different parquet file name). Or, you can use the `pq.write_to_dataset()` function, which can automatically partition your data in multiple files based on a column. But, this function requires the full dataset in memory as a pandas dataframe or pyarrow table (so this is not compatible with the chunked csv reading). If you want to do it in chunks, it might be easier to use a higher level package such as dask. Dask can read a csv file in chunks and write to parquet using pyarrow automatically (see eg https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv). It would look like: import dask.dataframe as dd df = dd.read_csv(..) df.to_parquet(.., partition_on=['col'], engine="pyarrow") Dask can also write a (common) metadata file for you. If you want to do this manually using pyarrow, you can take a look at the `parquet.write_metadata` function ( https://github.com/apache/arrow/blob/494e7a9c5714f3ed9e5590aeef8362114d5a3a46/python/pyarrow/parquet.py#L1748-L1783). This needs to be better documented (covered by https://issues.apache.org/jira/browse/ARROW-3154). Best, Joris On Sat, 30 May 2020 at 16:52, Palak Harwani wrote: > Hi, > I had a few questions regarding pyarrow.parquet. I want to write a Parquet > dataset which is partitioned according to one column. I have a large csv > file and I'm using chunks of csv using the following code : > > # csv_to_parquet.py > > import pandas as pdimport pyarrow as paimport pyarrow.parquet as pq > > csv_file = '/path/to/my.tsv' > parquet_file = '/path/to/my.parquet' > chunksize = 100_000 > > csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize, > low_memory=False) > for i, chunk in enumerate(csv_stream): > print("Chunk", i) > if i == 0: > # Guess the schema of the CSV file from the first chunk > parquet_schema = pa.Table.from_pandas(df=chunk).schema > # Open a Parquet file for writing > parquet_writer = pq.ParquetWriter(parquet_file, > parquet_schema, compression='snappy') > # Write CSV chunk to the parquet file > table = pa.Table.from_pandas(chunk, schema=parquet_schema) > parquet_writer.write_table(table) > > > parquet_writer.close() > > > > But this code writes a single parquet file and I don't see any method in > Parquet writer to write to a dataset, It just has the write_table method. > Is there a way to do this ? > > Also how do I write the metadata file in the example mentioned above and > the common metadata file as well as the metadata files in case of a > partitioned dataset? > > Thanks in advanced. > > -- > *Regards,* > *Palak Harwani* >
[DISCUSS] Add kernel integer overflow handling
https://github.com/apache/arrow/pull/7341#issuecomment-638241193 How should arithmetic kernels handle integer overflow? The approach currently taken in the linked PR is to promote such that overflow will not occur, for example `(int8, int8)->int16` and `(uint16, uint16)->uint32`. I'm not sure that's desirable. For one thing this leads to inconsistent handling of 64 bit integer types, which are currently allowed to overflow since we cannot promote further (NB: that means this kernel includes undefined behavior for int64). There are a few other approaches we could take (ordered by personal preference): - define explicit overflow behavior for signed integer operands (for example if we declared that add(i8(a), i8(b)) will always be equivalent to i8(i16(a) + i16(b)) then we could instantiate only unsigned addition kernels) - raise an error on signed overflow - provide ArithmeticOptions::overflow_behavior and allow users to choose between these - require users to pass arguments which will not overflow
Re: [DISCUSS] Adding "byteWidth" field to Decimal Flatbuffers type for forward compatibility
Sounds good to me. Regards Antoine. On Mon, 1 Jun 2020 17:47:38 -0500 Wes McKinney wrote: > I mentioned this on the recent sync call and opened > > https://issues.apache.org/jira/browse/ARROW-8985 > > I believe at some point that Arrow may need to be used to transport > decimal widths different from 128 bits. For example systems like > Apache Kudu have 32-bit and 64-bit decimals. Computational code may > immediately promote small decimals, but it's valuable to be able to > transfer and represent the data as is rather than forcing an > up-promotion even for low-precision decimal data. > > In order to allow for this work to possibly happen in the future > without requiring a new value be added to the "Type" Flatbuffers > union, I propose to add a "byteWidth" field with default value 16 to > the existing Decimal type. Here is a patch with this change: > > https://github.com/apache/arrow/pull/7321 > > To make the forward compatibility issue clear, if this field is not > added then current library versions would not be able to perceive the > absence of the field, this making it unsafe for future library > versions to annotate anything other than 16-byte decimals with this > metadata. > > As part of adopting this change, we would want to add assertions to > the existing libraries to check that the byteWidth is indeed 16 and > either throwing an exception or passing through the data as > FixedSizeBinary otherwise. > > Thanks, > Wes >
[jira] [Created] (ARROW-9028) [R] Should be able to convert an empty table
Francois Saint-Jacques created ARROW-9028: - Summary: [R] Should be able to convert an empty table Key: ARROW-9028 URL: https://issues.apache.org/jira/browse/ARROW-9028 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9027) [Python] Split in multiple files + clean-up pyarrow.parquet tests
Joris Van den Bossche created ARROW-9027: Summary: [Python] Split in multiple files + clean-up pyarrow.parquet tests Key: ARROW-9027 URL: https://issues.apache.org/jira/browse/ARROW-9027 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche The current {{test_parquet.py}} file is already above 4000 lines of code, and it is becoming a bit unwieldy to work with. Better structuring it, and maybe splitting it in multiple files, would help (separate test files could cover tests for basic reading/writing, tests for metadata/statistics objects, tests for multi-file datasets) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9026) [C++/Python] Force package removal from arrow-nightlies conda repository
Uwe Korn created ARROW-9026: --- Summary: [C++/Python] Force package removal from arrow-nightlies conda repository Key: ARROW-9026 URL: https://issues.apache.org/jira/browse/ARROW-9026 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9025) Apache arrow fails to build with recent version of protobuf?
Keith Hughitt created ARROW-9025: Summary: Apache arrow fails to build with recent version of protobuf? Key: ARROW-9025 URL: https://issues.apache.org/jira/browse/ARROW-9025 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.17.0, 0.2.0 Environment: - Arch Linux 5.6.15 (64-bit) - Protobuf 3.12.0 - cmake 3.17.3 Reporter: Keith Hughitt Greetings! I am running into errors when attempting to compile recent versions of Arrow (0.17.0 and nightly for 2020-06-02). I believe the issue may be related to a recent update to the protobuf library (version 3.12.0). Error: ``` {{... Scanning dependencies of target plasma-serialization-tests [ 41%] Building CXX object src/plasma/CMakeFiles/plasma-serialization-tests.dir/test/serialization_tests.cc.o In file included from /usr/include/string.h:495, from /home/keith/software/arrow-git/src/build/googletest_ep-prefix/include/gtest/internal/gtest-port.h:270, from /home/keith/software/arrow-git/src/build/googletest_ep-prefix/include/gtest/internal/gtest-internal.h:40, from /home/keith/software/arrow-git/src/build/googletest_ep-prefix/include/gtest/gtest.h:59, from /home/keith/software/arrow-git/src/arrow/cpp/src/plasma/test/serialization_tests.cc:23: In function ‘char* strncpy(char*, const char*, size_t)’, inlined from ‘int plasma::TestPlasmaSerialization::CreateTemporaryFile()’ at /home/keith/software/arrow-git/src/arrow/cpp/src/plasma/test/serialization_tests.cc:85:12: /usr/include/bits/string_fortified.h:106:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 1024 equals destination size [-Wstringop-truncation] 106 | return __builtin___strncpy_chk (__dest, __src, __len, __bos (__dest)); | ^~ [ 41%] Linking CXX executable ../../release/plasma-serialization-tests /usr/bin/ld: ../../release/libarrow.so.18.0.0: undefined reference to `google::protobuf::internal::ComputeUnknownFieldsSize(google::protobuf::internal::InternalMetadataWithArena const&, unsigned long, google::protobuf::internal::CachedSize*)' /usr/bin/ld: ../../release/libarrow.so.18.0.0: undefined reference to `google::protobuf::internal::UnknownFieldParse(unsigned int, google::protobuf::internal::InternalMetadataWithArena*, char const*, google::protobuf::internal::ParseContext*)' /usr/bin/ld: ../../release/libarrow.so.18.0.0: undefined reference to `google::protobuf::internal::AssignDescriptors(google::protobuf::internal::DescriptorTable const*)' collect2: error: ld returned 1 exit status make[2]: *** [src/plasma/CMakeFiles/plasma-serialization-tests.dir/build.make:123: release/plasma-serialization-tests] Error 1 make[1]: *** [CMakeFiles/Makefile2:1516: src/plasma/CMakeFiles/plasma-serialization-tests.dir/all] Error 2 make: *** [Makefile:161: all] Error 2}} ``` The result is similar for the most recent nightly version of Arrow (2020-06-02). Just let me know if there is anything you would like me to try / any other information I can help provide. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9024) [C++/Python] Install anaconda-client in conda-clean job
Uwe Korn created ARROW-9024: --- Summary: [C++/Python] Install anaconda-client in conda-clean job Key: ARROW-9024 URL: https://issues.apache.org/jira/browse/ARROW-9024 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9023) [C++] Use mimalloc conda package
Uwe Korn created ARROW-9023: --- Summary: [C++] Use mimalloc conda package Key: ARROW-9023 URL: https://issues.apache.org/jira/browse/ARROW-9023 Project: Apache Arrow Issue Type: Improvement Components: C++, Packaging Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9022) [C++][Compute] Make Add function safe for numeric limits
Krisztian Szucs created ARROW-9022: -- Summary: [C++][Compute] Make Add function safe for numeric limits Key: ARROW-9022 URL: https://issues.apache.org/jira/browse/ARROW-9022 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 Currently the output type of the Add function is identical with the argument types which makes it unsafe to add numeric limit values, so instead of using {{(int8, int8) -> int8}} signature we should use {{((int8, int8) -> int16}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-06-03-0
Arrow Build Report for Job nightly-2020-06-03-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0 Failed Tasks: - centos-8-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-travis-centos-8-aarch64 - conda-clean: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-azure-conda-clean - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-azure-conda-win-vs2015-py38 - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-travis-homebrew-cpp - homebrew-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-travis-homebrew-r-autobrew - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-test-conda-cpp-valgrind - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-test-conda-python-3.7-dask-latest - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-test-conda-python-3.7-spark-master - test-conda-python-3.8-dask-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-test-conda-python-3.8-dask-master - test-conda-python-3.8-jpype: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-test-conda-python-3.8-jpype - wheel-osx-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-travis-wheel-osx-cp38 Succeeded Tasks: - centos-6-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-centos-6-amd64 - centos-7-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-travis-centos-7-aarch64 - centos-7-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-centos-7-amd64 - centos-8-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-centos-8-amd64 - debian-buster-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-debian-buster-amd64 - debian-buster-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-travis-debian-buster-arm64 - debian-stretch-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-debian-stretch-amd64 - debian-stretch-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-travis-debian-stretch-arm64 - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-travis-gandiva-jar-osx - gandiva-jar-xenial: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-travis-gandiva-jar-xenial - nuget: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-nuget - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-test-conda-cpp - test-conda-python-3.6-pandas-0.23: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-test-conda-python-3.6-pandas-0.23 - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-test-conda-python-3.6 - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-kartothek-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-03-0-github-test-conda
[jira] [Created] (ARROW-9021) [Python] The filesystem keyword in parquet.read_table is not documented
Joris Van den Bossche created ARROW-9021: Summary: [Python] The filesystem keyword in parquet.read_table is not documented Key: ARROW-9021 URL: https://issues.apache.org/jira/browse/ARROW-9021 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)