[jira] [Commented] (ARROW-2434) [Rust] Add windows support
[ https://issues.apache.org/jira/browse/ARROW-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431703#comment-16431703 ] ASF GitHub Bot commented on ARROW-2434: --- andygrove commented on issue #1873: ARROW-2434: [Rust] Add windows support URL: https://github.com/apache/arrow/pull/1873#issuecomment-379973257 Hi @paddyhoran I tried to assign to you in JIRA but couldn't find your username on there. I think you need to create yourself a JIRA account first and then you should be able to self-assign. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add windows support > -- > > Key: ARROW-2434 > URL: https://issues.apache.org/jira/browse/ARROW-2434 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Currently `cargo test` fails on windows OS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2434) [Rust] Add windows support
[ https://issues.apache.org/jira/browse/ARROW-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431640#comment-16431640 ] ASF GitHub Bot commented on ARROW-2434: --- paddyhoran commented on issue #1873: ARROW-2434: [Rust] Add windows support URL: https://github.com/apache/arrow/pull/1873#issuecomment-379958446 ARROW-2436 will add CI for windows This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add windows support > -- > > Key: ARROW-2434 > URL: https://issues.apache.org/jira/browse/ARROW-2434 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Currently `cargo test` fails on windows OS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2435) [Rust] Add memory pool abstraction.
Renjie Liu created ARROW-2435: - Summary: [Rust] Add memory pool abstraction. Key: ARROW-2435 URL: https://issues.apache.org/jira/browse/ARROW-2435 Project: Apache Arrow Issue Type: Improvement Components: Rust Affects Versions: 0.9.0 Reporter: Renjie Liu Add memory pool abstraction as the c++ api. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2434) [Rust] Add windows support
[ https://issues.apache.org/jira/browse/ARROW-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2434: -- Labels: pull-request-available (was: ) > [Rust] Add windows support > -- > > Key: ARROW-2434 > URL: https://issues.apache.org/jira/browse/ARROW-2434 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Currently `cargo test` fails on windows OS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2434) [Rust] Add windows support
[ https://issues.apache.org/jira/browse/ARROW-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431621#comment-16431621 ] ASF GitHub Bot commented on ARROW-2434: --- paddyhoran opened a new pull request #1873: ARROW-2434: [Rust] Add windows support URL: https://github.com/apache/arrow/pull/1873 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add windows support > -- > > Key: ARROW-2434 > URL: https://issues.apache.org/jira/browse/ARROW-2434 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Currently `cargo test` fails on windows OS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2434) [Rust] Add windows support
Paddy Horan created ARROW-2434: -- Summary: [Rust] Add windows support Key: ARROW-2434 URL: https://issues.apache.org/jira/browse/ARROW-2434 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Paddy Horan Fix For: 0.10.0 Currently `cargo test` fails on windows OS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2423) [Python] PyArrow datatypes raise ValueError on equality checks against non-PyArrow objects
[ https://issues.apache.org/jira/browse/ARROW-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431613#comment-16431613 ] ASF GitHub Bot commented on ARROW-2423: --- paddyhoran commented on issue #1871: ARROW-2423: [Rust] Add Builder.push_slice(&[T]) URL: https://github.com/apache/arrow/pull/1871#issuecomment-379952111 @andygrove just noticed that the jira for this one is 2433 not 2423 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] PyArrow datatypes raise ValueError on equality checks against > non-PyArrow objects > -- > > Key: ARROW-2423 > URL: https://issues.apache.org/jira/browse/ARROW-2423 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > PyArrow 0.9.0 (py36_1) > Python 3.6.3 >Reporter: Dave Challis >Priority: Minor > Labels: pull-request-available > > Checking a PyArrow datatype object for equality with non-PyArrow datatypes > causes a `ValueError` to be raised, rather than either returning a True/False > value, or returning > [NotImplemented|https://docs.python.org/3/library/constants.html#NotImplemented] > if the comparison isn't implemented. > E.g. attempting to call: > {code:java} > import pyarrow > pyarrow.int32() == 'foo' > {code} > results in: > {code:java} > Traceback (most recent call last): > File "types.pxi", line 1221, in pyarrow.lib.type_for_alias > KeyError: 'foo' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "t.py", line 2, in > pyarrow.int32() == 'foo' > File "types.pxi", line 90, in pyarrow.lib.DataType.__richcmp__ > File "types.pxi", line 113, in pyarrow.lib.DataType.equals > File "types.pxi", line 1223, in pyarrow.lib.type_for_alias > ValueError: No type alias for foo > {code} > The expected outcome for the above would be for the comparison to return > `False`, as that's the general behaviour for comparisons between objects of > different types (e.g. `1 == 'foo'` or `object() == 12.4` both return `False`). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2426) [CI] glib build failure
[ https://issues.apache.org/jira/browse/ARROW-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2426: -- Labels: pull-request-available (was: ) > [CI] glib build failure > --- > > Key: ARROW-2426 > URL: https://issues.apache.org/jira/browse/ARROW-2426 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > The glib build on Travis-CI fails: > [https://travis-ci.org/apache/arrow/jobs/364123364#L6840] > {code} > ==> Installing gobject-introspection > ==> Downloading > https://homebrew.bintray.com/bottles/gobject-introspection-1.56.0_1.sierra.bottle.tar.gz > ==> Pouring gobject-introspection-1.56.0_1.sierra.bottle.tar.gz > /usr/local/Cellar/gobject-introspection/1.56.0_1: 173 files, 9.8MB > Installing gobject-introspection has failed! > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2433) [Rust] Add Builder.push_slice(&[T])
[ https://issues.apache.org/jira/browse/ARROW-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431510#comment-16431510 ] Andy Grove commented on ARROW-2433: --- PR: https://github.com/apache/arrow/pull/1871 > [Rust] Add Builder.push_slice(&[T]) > --- > > Key: ARROW-2433 > URL: https://issues.apache.org/jira/browse/ARROW-2433 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 0.10.0 > > > When populating a Builder with Utf8 data it is more efficient to push > whole strings as &[u8] rather than one byte at a time. > The same optimization works for all other types too. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2423) [Python] PyArrow datatypes raise ValueError on equality checks against non-PyArrow objects
[ https://issues.apache.org/jira/browse/ARROW-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431509#comment-16431509 ] ASF GitHub Bot commented on ARROW-2423: --- andygrove opened a new pull request #1871: ARROW-2423: [Rust] Add Builder.push_slice(&[T]) URL: https://github.com/apache/arrow/pull/1871 This PR also fixes another instance of memory not being released. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] PyArrow datatypes raise ValueError on equality checks against > non-PyArrow objects > -- > > Key: ARROW-2423 > URL: https://issues.apache.org/jira/browse/ARROW-2423 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > PyArrow 0.9.0 (py36_1) > Python 3.6.3 >Reporter: Dave Challis >Priority: Minor > Labels: pull-request-available > > Checking a PyArrow datatype object for equality with non-PyArrow datatypes > causes a `ValueError` to be raised, rather than either returning a True/False > value, or returning > [NotImplemented|https://docs.python.org/3/library/constants.html#NotImplemented] > if the comparison isn't implemented. > E.g. attempting to call: > {code:java} > import pyarrow > pyarrow.int32() == 'foo' > {code} > results in: > {code:java} > Traceback (most recent call last): > File "types.pxi", line 1221, in pyarrow.lib.type_for_alias > KeyError: 'foo' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "t.py", line 2, in > pyarrow.int32() == 'foo' > File "types.pxi", line 90, in pyarrow.lib.DataType.__richcmp__ > File "types.pxi", line 113, in pyarrow.lib.DataType.equals > File "types.pxi", line 1223, in pyarrow.lib.type_for_alias > ValueError: No type alias for foo > {code} > The expected outcome for the above would be for the comparison to return > `False`, as that's the general behaviour for comparisons between objects of > different types (e.g. `1 == 'foo'` or `object() == 12.4` both return `False`). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2423) [Python] PyArrow datatypes raise ValueError on equality checks against non-PyArrow objects
[ https://issues.apache.org/jira/browse/ARROW-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2423: -- Labels: pull-request-available (was: ) > [Python] PyArrow datatypes raise ValueError on equality checks against > non-PyArrow objects > -- > > Key: ARROW-2423 > URL: https://issues.apache.org/jira/browse/ARROW-2423 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > PyArrow 0.9.0 (py36_1) > Python 3.6.3 >Reporter: Dave Challis >Priority: Minor > Labels: pull-request-available > > Checking a PyArrow datatype object for equality with non-PyArrow datatypes > causes a `ValueError` to be raised, rather than either returning a True/False > value, or returning > [NotImplemented|https://docs.python.org/3/library/constants.html#NotImplemented] > if the comparison isn't implemented. > E.g. attempting to call: > {code:java} > import pyarrow > pyarrow.int32() == 'foo' > {code} > results in: > {code:java} > Traceback (most recent call last): > File "types.pxi", line 1221, in pyarrow.lib.type_for_alias > KeyError: 'foo' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "t.py", line 2, in > pyarrow.int32() == 'foo' > File "types.pxi", line 90, in pyarrow.lib.DataType.__richcmp__ > File "types.pxi", line 113, in pyarrow.lib.DataType.equals > File "types.pxi", line 1223, in pyarrow.lib.type_for_alias > ValueError: No type alias for foo > {code} > The expected outcome for the above would be for the comparison to return > `False`, as that's the general behaviour for comparisons between objects of > different types (e.g. `1 == 'foo'` or `object() == 12.4` both return `False`). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2433) [Rust] Add Builder.push_slice(&[T])
Andy Grove created ARROW-2433: - Summary: [Rust] Add Builder.push_slice(&[T]) Key: ARROW-2433 URL: https://issues.apache.org/jira/browse/ARROW-2433 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 0.10.0 When populating a Builder with Utf8 data it is more efficient to push whole strings as &[u8] rather than one byte at a time. The same optimization works for all other types too. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2387) negative decimal values get spurious rescaling error
[ https://issues.apache.org/jira/browse/ARROW-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431508#comment-16431508 ] ASF GitHub Bot commented on ARROW-2387: --- cpcloud commented on issue #1832: ARROW-2387: flip test for rescale loss if value < 0 URL: https://github.com/apache/arrow/pull/1832#issuecomment-379927866 @bwo Looks like this is failing for unrelated reasons, can you rebase on top of master and push again? Then we can merge. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > negative decimal values get spurious rescaling error > > > Key: ARROW-2387 > URL: https://issues.apache.org/jira/browse/ARROW-2387 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.9.0 >Reporter: ben w >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > {code:java} > $ python > Python 2.7.12 (default, Nov 20 2017, 18:23:56) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow as pa, decimal > >>> one = decimal.Decimal('1.00') > >>> neg_one = decimal.Decimal('-1.00') > >>> pa.array([one], pa.decimal128(24, 12)) > > [ > Decimal('1.') > ] > >>> pa.array([neg_one], pa.decimal128(24, 12)) > Traceback (most recent call last): > File "", line 1, in > File "array.pxi", line 181, in pyarrow.lib.array > File "array.pxi", line 36, in pyarrow.lib._sequence_to_array > File "error.pxi", line 77, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Rescaling decimal value -100.00 from > original scale of 6 to new scale of 12 would cause data loss > >>> pa.__version__ > '0.9.0' > {code} > not only is the error spurious, the decimal value has been multiplied by one > million (i.e. 10 ** 6 and 6 is the difference in scales, but this is still > pretty strange to me). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431450#comment-16431450 ] Phillip Cloud commented on ARROW-2432: -- [~bryanc] Awesome, thanks. > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431438#comment-16431438 ] Bryan Cutler commented on ARROW-2432: - It should be possible to share code paths when converting objects right? I'd like to keep this with the minimum fix, lets look at possible refactoring after. Thanks [~cpcloud], I already made the fix, just going to add tests. > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431398#comment-16431398 ] Phillip Cloud commented on ARROW-2432: -- [~pitrou] FWIW, the code conversion paths are not specific to decimal types and have been around since before decimals existed. [~bryanc] If you're not already working on this, then I can probably get it fixed up pretty quickly. > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431341#comment-16431341 ] Bryan Cutler commented on ARROW-2432: - We really need to get the integration testing running regularly, or at least before a release > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431331#comment-16431331 ] Antoine Pitrou commented on ARROW-2432: --- Ow. For some reason it seems we have various code conversion paths depend on which API is called :-/ {code:python} >>> data = [decimal.Decimal('3.14'), None] >>> pa.array(data, type=pa.decimal128(12, 4)) [ Decimal('3.1400'), NA ] >>> pa.array(data, type=pa.decimal128(12, 4), from_pandas=True) [ Decimal('3.1400'), NA ] >>> pa.Array.from_pandas(data, type=pa.decimal128(12, 4)) [ Decimal('3.1400'), NA ] >>> pa.Array.from_pandas(pd.Series(data), type=pa.decimal128(12, 4)) Traceback (most recent call last): File "", line 1, in pa.Array.from_pandas(pd.Series(data), type=pa.decimal128(12, 4)) File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas File "array.pxi", line 177, in pyarrow.lib.array File "error.pxi", line 77, in pyarrow.lib.check_status File "error.pxi", line 77, in pyarrow.lib.check_status ArrowInvalid: /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1702 code: converter.Convert() Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal {code} > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-2432: Description: Using from_pandas to convert decimals fails if encounters a value of {{None}}. For example: {code:java} In [1]: import pyarrow as pa ...: import pandas as pd ...: from decimal import Decimal ...: In [2]: s_dec = pd.Series([Decimal('3.14'), None]) In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) --- ArrowInvalid Traceback (most recent call last) in () > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) array.pxi in pyarrow.lib.Array.from_pandas() array.pxi in pyarrow.lib.array() error.pxi in pyarrow.lib.check_status() error.pxi in pyarrow.lib.check_status() ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal {code} The above error is raised when specifying decimal type. When no type is specified, a seg fault happens. This previously worked in 0.8.0. was: Using from_pandas to convert decimals fails if encounters a value of {{None}}. For example: {code:java} In [1]: import pyarrow as pa ...: import pandas as pd ...: from decimal import Decimal ...: In [2]: s_dec = pd.Series([Decimal('3.14'), None]) In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) --- ArrowInvalid Traceback (most recent call last) in () > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) array.pxi in pyarrow.lib.Array.from_pandas() array.pxi in pyarrow.lib.array() error.pxi in pyarrow.lib.check_status() error.pxi in pyarrow.lib.check_status() ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal In [4]: s_dec Out[4]: 0 3.14 1 None dtype: object{code} The above error is raised when specifying decimal type. When no type is specified, a seg fault happens. This previously worked in 0.8.0. > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-2432: Summary: [Python] from_pandas fails when converting decimals if have None values (was: [Python] from_pandas fails when converting decimals if contain None) > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > In [4]: s_dec > Out[4]: > 0 3.14 > 1 None > dtype: object{code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2432) [Python] from_pandas fails when converting decimals if contain None
Bryan Cutler created ARROW-2432: --- Summary: [Python] from_pandas fails when converting decimals if contain None Key: ARROW-2432 URL: https://issues.apache.org/jira/browse/ARROW-2432 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Bryan Cutler Using from_pandas to convert decimals fails if encounters a value of {{None}}. For example: {code:java} In [1]: import pyarrow as pa ...: import pandas as pd ...: from decimal import Decimal ...: In [2]: s_dec = pd.Series([Decimal('3.14'), None]) In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) --- ArrowInvalid Traceback (most recent call last) in () > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) array.pxi in pyarrow.lib.Array.from_pandas() array.pxi in pyarrow.lib.array() error.pxi in pyarrow.lib.check_status() error.pxi in pyarrow.lib.check_status() ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal In [4]: s_dec Out[4]: 0 3.14 1 None dtype: object{code} The above error is raised when specifying decimal type. When no type is specified, a seg fault happens. This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if contain None
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431276#comment-16431276 ] Bryan Cutler commented on ARROW-2432: - I can work on this > [Python] from_pandas fails when converting decimals if contain None > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Priority: Major > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > In [4]: s_dec > Out[4]: > 0 3.14 > 1 None > dtype: object{code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phillip Cloud reassigned ARROW-1938: Assignee: (was: Phillip Cloud) > [Python] Error writing to partitioned Parquet dataset > - > > Key: ARROW-1938 > URL: https://issues.apache.org/jira/browse/ARROW-1938 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux (Ubuntu 16.04) >Reporter: Robert Dailey >Priority: Major > Fix For: 0.10.0 > > Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, > pyarrow_dataset_error.png > > > I receive the following error after upgrading to pyarrow 0.8.0 when writing > to a dataset: > * ArrowIOError: Column 3 had 187374 while previous column had 1 > The command was: > write_table_values = {'row_group_size': 1} > pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), > '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', > 'hour'], **write_table_values) > I've also tried write_table_values = {'chunk_size': 1} and received the > same error. > This same command works in version 0.7.1. I am trying to troubleshoot the > problem but wanted to submit a ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431175#comment-16431175 ] ASF GitHub Bot commented on ARROW-2391: --- kszucs commented on issue #1859: ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 URL: https://github.com/apache/arrow/pull/1859#issuecomment-379883273 My pleasure! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) > df['created'] = pd.to_datetime(df['created'])}} > schema = pa.schema([pa.field('created', pa.date64())]) > pa.Table.from_pandas(df, schema=schema) > {code} > Executing the above causes the python interpreter to exit with "Segmentation > fault: 11". > Attempting to convert into various other datatypes (by specifying different > schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431167#comment-16431167 ] ASF GitHub Bot commented on ARROW-2391: --- pitrou commented on issue #1859: ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 URL: https://github.com/apache/arrow/pull/1859#issuecomment-379882116 Thank you @kszucs ! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) > df['created'] = pd.to_datetime(df['created'])}} > schema = pa.schema([pa.field('created', pa.date64())]) > pa.Table.from_pandas(df, schema=schema) > {code} > Executing the above causes the python interpreter to exit with "Segmentation > fault: 11". > Attempting to convert into various other datatypes (by specifying different > schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2430) MVP for branch based packaging automation
[ https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431124#comment-16431124 ] Krisztian Szucs edited comment on ARROW-2430 at 4/9/18 8:00 PM: Additional TODO notes: - write readme - create a docker container with the dependencies pre-installed - not about turning off auto cancellation feature of CI servers - setup deployments + conda deploy script - consult about flattening the builds (remove build matrices) - format commit message was (Author: kszucs): Additional TODO notes: - write readme - create a docker container with the dependencies pre-installed - not about turning off auto cancellation feature of CI servers - setup deployments + conda deploy script - consult about flattening the builds (remove build matrices) > MVP for branch based packaging automation > - > > Key: ARROW-2430 > URL: https://issues.apache.org/jira/browse/ARROW-2430 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Priority: Major > Labels: pull-request-available > > Described in > https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2430) MVP for branch based packaging automation
[ https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431124#comment-16431124 ] Krisztian Szucs edited comment on ARROW-2430 at 4/9/18 7:59 PM: Additional TODO notes: - write readme - create a docker container with the dependencies pre-installed - not about turning off auto cancellation feature of CI servers - setup deployments - consult about flattening the builds (remove build matrices) was (Author: kszucs): Additional TODO notes: - write readme - create a docker container with the dependencies pre-installed - not about turning off auto cancellation feature of CI servers - setup deployments - consult about flattening the builds (remove build matrices) > MVP for branch based packaging automation > - > > Key: ARROW-2430 > URL: https://issues.apache.org/jira/browse/ARROW-2430 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Priority: Major > Labels: pull-request-available > > Described in > https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2430) MVP for branch based packaging automation
[ https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431124#comment-16431124 ] Krisztian Szucs edited comment on ARROW-2430 at 4/9/18 7:59 PM: Additional TODO notes: - write readme - create a docker container with the dependencies pre-installed - not about turning off auto cancellation feature of CI servers - setup deployments + conda deploy script - consult about flattening the builds (remove build matrices) was (Author: kszucs): Additional TODO notes: - write readme - create a docker container with the dependencies pre-installed - not about turning off auto cancellation feature of CI servers - setup deployments - consult about flattening the builds (remove build matrices) > MVP for branch based packaging automation > - > > Key: ARROW-2430 > URL: https://issues.apache.org/jira/browse/ARROW-2430 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Priority: Major > Labels: pull-request-available > > Described in > https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431128#comment-16431128 ] ASF GitHub Bot commented on ARROW-2391: --- pitrou closed pull request #1859: ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 URL: https://github.com/apache/arrow/pull/1859 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/compute/kernels/cast.cc b/cpp/src/arrow/compute/kernels/cast.cc index eaebd7cef..bfd519d18 100644 --- a/cpp/src/arrow/compute/kernels/cast.cc +++ b/cpp/src/arrow/compute/kernels/cast.cc @@ -396,21 +396,34 @@ struct CastFunctor{ ShiftTime (ctx, options, conversion.first, conversion.second, input, output); -internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, - input.length); - // Ensure that intraday milliseconds have been zeroed out auto out_data = GetMutableValues(output, 1); -for (int64_t i = 0; i < input.length; ++i) { - const int64_t remainder = out_data[i] % kMillisecondsInDay; - if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && - remainder > 0)) { -ctx->SetStatus( -Status::Invalid("Timestamp value had non-zero intraday milliseconds")); -break; + +if (input.null_count != 0) { + internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, +input.length); + + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && +remainder > 0)) { + ctx->SetStatus( + Status::Invalid("Timestamp value had non-zero intraday milliseconds")); + break; +} +out_data[i] -= remainder; +bit_reader.Next(); + } +} else { + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && remainder > 0)) { + ctx->SetStatus( + Status::Invalid("Timestamp value had non-zero intraday milliseconds")); + break; +} +out_data[i] -= remainder; } - out_data[i] -= remainder; - bit_reader.Next(); } } }; diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index c6e2b75be..de6120176 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -807,6 +807,44 @@ def test_datetime64_to_date32(self): assert arr2.equals(arr.cast('date32')) +@pytest.mark.parametrize('mask', [ +None, +np.ones(3), +np.array([True, False, False]), +]) +def test_pandas_datetime_to_date64(self, mask): +s = pd.to_datetime([ +'2018-05-10T00:00:00', +'2018-05-11T00:00:00', +'2018-05-12T00:00:00', +]) +arr = pa.Array.from_pandas(s, type=pa.date64(), mask=mask) + +data = np.array([ +date(2018, 5, 10), +date(2018, 5, 11), +date(2018, 5, 12) +]) +expected = pa.array(data, mask=mask, type=pa.date64()) + +assert arr.equals(expected) + +@pytest.mark.parametrize('mask', [ +None, +np.ones(3), +np.array([True, False, False]) +]) +def test_pandas_datetime_to_date64_failures(self, mask): +s = pd.to_datetime([ +'2018-05-10T10:24:01', +'2018-05-11T10:24:01', +'2018-05-12T10:24:01', +]) + +expected_msg = 'Timestamp value had non-zero intraday milliseconds' +with pytest.raises(pa.ArrowInvalid, msg=expected_msg): +pa.Array.from_pandas(s, type=pa.date64(), mask=mask) + def test_date_infer(self): df = pd.DataFrame({ 'date': [date(2000, 1, 1), This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 >
[jira] [Resolved] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-2391. --- Resolution: Fixed Fix Version/s: 0.10.0 Issue resolved by pull request 1859 [https://github.com/apache/arrow/pull/1859] > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) > df['created'] = pd.to_datetime(df['created'])}} > schema = pa.schema([pa.field('created', pa.date64())]) > pa.Table.from_pandas(df, schema=schema) > {code} > Executing the above causes the python interpreter to exit with "Segmentation > fault: 11". > Attempting to convert into various other datatypes (by specifying different > schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2430) MVP for branch based packaging automation
[ https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431126#comment-16431126 ] ASF GitHub Bot commented on ARROW-2430: --- kszucs opened a new pull request #1869: ARROW-2430: [Packaging] MVP for branch based packaging automation URL: https://github.com/apache/arrow/pull/1869 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > MVP for branch based packaging automation > - > > Key: ARROW-2430 > URL: https://issues.apache.org/jira/browse/ARROW-2430 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Priority: Major > Labels: pull-request-available > > Described in > https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2430) MVP for branch based packaging automation
[ https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2430: -- Labels: pull-request-available (was: ) > MVP for branch based packaging automation > - > > Key: ARROW-2430 > URL: https://issues.apache.org/jira/browse/ARROW-2430 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Priority: Major > Labels: pull-request-available > > Described in > https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2430) MVP for branch based packaging automation
[ https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431124#comment-16431124 ] Krisztian Szucs edited comment on ARROW-2430 at 4/9/18 7:50 PM: Additional TODO notes: - write readme - create a docker container with the dependencies pre-installed - not about turning off auto cancellation feature of CI servers - setup deployments - consult about flattening the builds (remove build matrices) was (Author: kszucs): Additional TODO notes: - write readme - create a docker container with the dependencies pre-installed - not about turning off auto cancellation feature of CI servers > MVP for branch based packaging automation > - > > Key: ARROW-2430 > URL: https://issues.apache.org/jira/browse/ARROW-2430 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Priority: Major > > Described in > https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2430) MVP for branch based packaging automation
[ https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431124#comment-16431124 ] Krisztian Szucs commented on ARROW-2430: Additional TODOs: - write readme - create a docker container with the dependencies pre-installed - not about turning off auto cancellation feature of CI servers > MVP for branch based packaging automation > - > > Key: ARROW-2430 > URL: https://issues.apache.org/jira/browse/ARROW-2430 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Priority: Major > > Described in > https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2430) MVP for branch based packaging automation
[ https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431124#comment-16431124 ] Krisztian Szucs edited comment on ARROW-2430 at 4/9/18 7:48 PM: Additional TODO notes: - write readme - create a docker container with the dependencies pre-installed - not about turning off auto cancellation feature of CI servers was (Author: kszucs): Additional TODOs: - write readme - create a docker container with the dependencies pre-installed - not about turning off auto cancellation feature of CI servers > MVP for branch based packaging automation > - > > Key: ARROW-2430 > URL: https://issues.apache.org/jira/browse/ARROW-2430 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Priority: Major > > Described in > https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2431) [Rust] Schema fidelity
Maximilian Roos created ARROW-2431: -- Summary: [Rust] Schema fidelity Key: ARROW-2431 URL: https://issues.apache.org/jira/browse/ARROW-2431 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Maximilian Roos ref [https://github.com/apache/arrow/pull/1829#discussion_r179248743] Currently our Traits are not loyal to [https://arrow.apache.org/docs/metadata.html]. For example, we nest `Field`s in the `DataType` (aka `type`) attribute of the parent Field (rather than having the type be `Struct` and a separate `Children` parameter) Is this OK, assuming that we can read and write accurate schemas? Or should we move towards having the Schema trait be consistent with the metadata spec? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2430) MVP for branch based packaging automation
Krisztian Szucs created ARROW-2430: -- Summary: MVP for branch based packaging automation Key: ARROW-2430 URL: https://issues.apache.org/jira/browse/ARROW-2430 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Krisztian Szucs Described in https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1780) JDBC Adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/ARROW-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431103#comment-16431103 ] ASF GitHub Bot commented on ARROW-1780: --- atuldambalkar commented on a change in pull request #1759: ARROW-1780 - [WIP] JDBC Adapter to convert Relational Data objects to Arrow Data Format Vector Objects URL: https://github.com/apache/arrow/pull/1759#discussion_r180205035 ## File path: java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java ## @@ -0,0 +1,343 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.adapter.jdbc; + +import org.apache.arrow.vector.*; +import org.apache.arrow.vector.types.DateUnit; +import org.apache.arrow.vector.types.TimeUnit; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.types.pojo.Schema; + +import java.nio.charset.Charset; +import java.sql.*; +import java.util.ArrayList; +import java.util.List; + +import static org.apache.arrow.vector.types.FloatingPointPrecision.DOUBLE; +import static org.apache.arrow.vector.types.FloatingPointPrecision.SINGLE; + + +/** + * Class that does most of the work to convert JDBC ResultSet data into Arrow columnar format Vector objects. + * + * @since 0.10.0 + */ +public class JdbcToArrowUtils { + +private static final int DEFAULT_BUFFER_SIZE = 256; + +/** + * Create Arrow {@link Schema} object for the given JDBC {@link ResultSetMetaData}. + * + * This method currently performs following type mapping for JDBC SQL data types to corresponding Arrow data types. + * + * CHAR--> ArrowType.Utf8 + * NCHAR --> ArrowType.Utf8 + * VARCHAR --> ArrowType.Utf8 + * NVARCHAR --> ArrowType.Utf8 + * LONGVARCHAR --> ArrowType.Utf8 + * LONGNVARCHAR --> ArrowType.Utf8 + * NUMERIC --> ArrowType.Decimal(precision, scale) + * DECIMAL --> ArrowType.Decimal(precision, scale) + * BIT --> ArrowType.Bool + * TINYINT --> ArrowType.Int(8, signed) + * SMALLINT --> ArrowType.Int(16, signed) + * INTEGER --> ArrowType.Int(32, signed) + * BIGINT --> ArrowType.Int(64, signed) + * REAL --> ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE) + * FLOAT --> ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE) + * DOUBLE --> ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE) + * BINARY --> ArrowType.Binary + * VARBINARY --> ArrowType.Binary + * LONGVARBINARY --> ArrowType.Binary + * DATE --> ArrowType.Date(DateUnit.MILLISECOND) + * TIME --> ArrowType.Time(TimeUnit.MILLISECOND, 32) + * TIMESTAMP --> ArrowType.Timestamp(TimeUnit.MILLISECOND, timezone=null) + * CLOB --> ArrowType.Utf8 + * BLOB --> ArrowType.Binary + * + * @param rsmd + * @return {@link Schema} + * @throws SQLException + */ +public static Schema jdbcToArrowSchema(ResultSetMetaData rsmd) throws SQLException { + +assert rsmd != null; + +//ImmutableList.Builder fields = ImmutableList.builder(); +List fields = new ArrayList<>(); +int columnCount = rsmd.getColumnCount(); +for (int i = 1; i <= columnCount; i++) { +String columnName = rsmd.getColumnName(i); +switch (rsmd.getColumnType(i)) { +case Types.BOOLEAN: +case Types.BIT: +fields.add(new Field(columnName, FieldType.nullable(new ArrowType.Bool()), null)); +break; +case Types.TINYINT: +fields.add(new Field(columnName, FieldType.nullable(new ArrowType.Int(8, true)), null)); +break; +case Types.SMALLINT: +fields.add(new Field(columnName, FieldType.nullable(new ArrowType.Int(16, true)), null)); +break; +case Types.INTEGER: +fields.add(new Field(columnName, FieldType.nullable(new ArrowType.Int(32, true)), null)); +break; +case
[jira] [Updated] (ARROW-2399) Builder should not provide a set() method
[ https://issues.apache.org/jira/browse/ARROW-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Roos updated ARROW-2399: --- Description: HArrays should be immutable, but we have a `set` method on Buffer that should not be there. This is only used from the Bitmap struct. Perhaps Bitmap should maintain its own memory instead and not use Buffer? was: Arrays should be immutable, but we have a `set` method on Buffer that should not be there. This is only used from the Bitmap struct. Perhaps Bitmap should maintain its own memory instead and not use Buffer? > Builder should not provide a set() method > > > Key: ARROW-2399 > URL: https://issues.apache.org/jira/browse/ARROW-2399 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Priority: Major > Fix For: 0.10.0 > > > HArrays should be immutable, but we have a `set` method on Buffer that > should not be there. > This is only used from the Bitmap struct. Perhaps Bitmap should maintain its > own memory instead and not use Buffer? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2399) Builder should not provide a set() method
[ https://issues.apache.org/jira/browse/ARROW-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431021#comment-16431021 ] Antoine Pitrou commented on ARROW-2399: --- Could you also please prefix Rust issues with "[Rust]", so that the list of issues gives more information? Thanks :-) > Builder should not provide a set() method > > > Key: ARROW-2399 > URL: https://issues.apache.org/jira/browse/ARROW-2399 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Priority: Major > Fix For: 0.10.0 > > > Arrays should be immutable, but we have a `set` method on Buffer that > should not be there. > This is only used from the Bitmap struct. Perhaps Bitmap should maintain its > own memory instead and not use Buffer? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1780) JDBC Adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/ARROW-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431015#comment-16431015 ] ASF GitHub Bot commented on ARROW-1780: --- atuldambalkar commented on a change in pull request #1759: ARROW-1780 - [WIP] JDBC Adapter to convert Relational Data objects to Arrow Data Format Vector Objects URL: https://github.com/apache/arrow/pull/1759#discussion_r180185358 ## File path: java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java ## @@ -0,0 +1,343 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.adapter.jdbc; + +import org.apache.arrow.vector.*; +import org.apache.arrow.vector.types.DateUnit; +import org.apache.arrow.vector.types.TimeUnit; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.types.pojo.Schema; + +import java.nio.charset.Charset; +import java.sql.*; +import java.util.ArrayList; +import java.util.List; + +import static org.apache.arrow.vector.types.FloatingPointPrecision.DOUBLE; +import static org.apache.arrow.vector.types.FloatingPointPrecision.SINGLE; + + +/** + * Class that does most of the work to convert JDBC ResultSet data into Arrow columnar format Vector objects. + * + * @since 0.10.0 + */ +public class JdbcToArrowUtils { + +private static final int DEFAULT_BUFFER_SIZE = 256; + +/** + * Create Arrow {@link Schema} object for the given JDBC {@link ResultSetMetaData}. + * + * This method currently performs following type mapping for JDBC SQL data types to corresponding Arrow data types. + * + * CHAR--> ArrowType.Utf8 + * NCHAR --> ArrowType.Utf8 + * VARCHAR --> ArrowType.Utf8 + * NVARCHAR --> ArrowType.Utf8 + * LONGVARCHAR --> ArrowType.Utf8 + * LONGNVARCHAR --> ArrowType.Utf8 + * NUMERIC --> ArrowType.Decimal(precision, scale) + * DECIMAL --> ArrowType.Decimal(precision, scale) + * BIT --> ArrowType.Bool + * TINYINT --> ArrowType.Int(8, signed) + * SMALLINT --> ArrowType.Int(16, signed) + * INTEGER --> ArrowType.Int(32, signed) + * BIGINT --> ArrowType.Int(64, signed) + * REAL --> ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE) + * FLOAT --> ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE) + * DOUBLE --> ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE) + * BINARY --> ArrowType.Binary + * VARBINARY --> ArrowType.Binary + * LONGVARBINARY --> ArrowType.Binary + * DATE --> ArrowType.Date(DateUnit.MILLISECOND) + * TIME --> ArrowType.Time(TimeUnit.MILLISECOND, 32) + * TIMESTAMP --> ArrowType.Timestamp(TimeUnit.MILLISECOND, timezone=null) + * CLOB --> ArrowType.Utf8 + * BLOB --> ArrowType.Binary + * + * @param rsmd + * @return {@link Schema} + * @throws SQLException + */ +public static Schema jdbcToArrowSchema(ResultSetMetaData rsmd) throws SQLException { + +assert rsmd != null; + +//ImmutableList.Builder fields = ImmutableList.builder(); +List fields = new ArrayList<>(); +int columnCount = rsmd.getColumnCount(); +for (int i = 1; i <= columnCount; i++) { +String columnName = rsmd.getColumnName(i); +switch (rsmd.getColumnType(i)) { +case Types.BOOLEAN: +case Types.BIT: +fields.add(new Field(columnName, FieldType.nullable(new ArrowType.Bool()), null)); +break; +case Types.TINYINT: +fields.add(new Field(columnName, FieldType.nullable(new ArrowType.Int(8, true)), null)); +break; +case Types.SMALLINT: +fields.add(new Field(columnName, FieldType.nullable(new ArrowType.Int(16, true)), null)); +break; +case Types.INTEGER: +fields.add(new Field(columnName, FieldType.nullable(new ArrowType.Int(32, true)), null)); +break; +case
[jira] [Assigned] (ARROW-2328) Writing a slice with feather ignores the offset
[ https://issues.apache.org/jira/browse/ARROW-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn reassigned ARROW-2328: -- Assignee: Adrian > Writing a slice with feather ignores the offset > --- > > Key: ARROW-2328 > URL: https://issues.apache.org/jira/browse/ARROW-2328 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Adrian >Assignee: Adrian >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Writing a slice from row n of length m of an array to feather would write the > first m rows, instead of the rows starting at n. > The null bitmap also ends up misaligned. Also tested and fixed in the pull > request below. > I've created a pull request with tests and fix here: > [Pullrequest#1766|https://github.com/apache/arrow/pull/1766] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2328) Writing a slice with feather ignores the offset
[ https://issues.apache.org/jira/browse/ARROW-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-2328: - Assignee: Antoine Pitrou > Writing a slice with feather ignores the offset > --- > > Key: ARROW-2328 > URL: https://issues.apache.org/jira/browse/ARROW-2328 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Adrian >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Writing a slice from row n of length m of an array to feather would write the > first m rows, instead of the rows starting at n. > The null bitmap also ends up misaligned. Also tested and fixed in the pull > request below. > I've created a pull request with tests and fix here: > [Pullrequest#1766|https://github.com/apache/arrow/pull/1766] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2328) Writing a slice with feather ignores the offset
[ https://issues.apache.org/jira/browse/ARROW-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-2328: - Assignee: (was: Antoine Pitrou) > Writing a slice with feather ignores the offset > --- > > Key: ARROW-2328 > URL: https://issues.apache.org/jira/browse/ARROW-2328 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Adrian >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Writing a slice from row n of length m of an array to feather would write the > first m rows, instead of the rows starting at n. > The null bitmap also ends up misaligned. Also tested and fixed in the pull > request below. > I've created a pull request with tests and fix here: > [Pullrequest#1766|https://github.com/apache/arrow/pull/1766] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2328) Writing a slice with feather ignores the offset
[ https://issues.apache.org/jira/browse/ARROW-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-2328. --- Resolution: Fixed Issue resolved by pull request 1784 [https://github.com/apache/arrow/pull/1784] > Writing a slice with feather ignores the offset > --- > > Key: ARROW-2328 > URL: https://issues.apache.org/jira/browse/ARROW-2328 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Adrian >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Writing a slice from row n of length m of an array to feather would write the > first m rows, instead of the rows starting at n. > The null bitmap also ends up misaligned. Also tested and fixed in the pull > request below. > I've created a pull request with tests and fix here: > [Pullrequest#1766|https://github.com/apache/arrow/pull/1766] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2427) [C++] ReadAt implementations suboptimal
[ https://issues.apache.org/jira/browse/ARROW-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2427: -- Labels: pull-request-available (was: ) > [C++] ReadAt implementations suboptimal > --- > > Key: ARROW-2427 > URL: https://issues.apache.org/jira/browse/ARROW-2427 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > The {{ReadAt}} implementations for at least {{OSFile}} and > {{MemoryMappedFile}} take the file lock and seek. They could instead read > directly from the given offset, allowing concurrent I/O from multiple threads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2427) [C++] ReadAt implementations suboptimal
[ https://issues.apache.org/jira/browse/ARROW-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430953#comment-16430953 ] ASF GitHub Bot commented on ARROW-2427: --- pitrou opened a new pull request #1867: [WIP] ARROW-2427: [C++] Implement ReadAt properly URL: https://github.com/apache/arrow/pull/1867 Allow for concurrent I/O by avoiding locking and seeking. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] ReadAt implementations suboptimal > --- > > Key: ARROW-2427 > URL: https://issues.apache.org/jira/browse/ARROW-2427 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > The {{ReadAt}} implementations for at least {{OSFile}} and > {{MemoryMappedFile}} take the file lock and seek. They could instead read > directly from the given offset, allowing concurrent I/O from multiple threads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430843#comment-16430843 ] ASF GitHub Bot commented on ARROW-2391: --- kszucs commented on a change in pull request #1859: ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 URL: https://github.com/apache/arrow/pull/1859#discussion_r180156054 ## File path: cpp/src/arrow/compute/kernels/cast.cc ## @@ -396,21 +396,34 @@ struct CastFunctor{ ShiftTime (ctx, options, conversion.first, conversion.second, input, output); -internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, - input.length); +if (input.null_count != 0) { + internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, +input.length); -// Ensure that intraday milliseconds have been zeroed out -auto out_data = GetMutableValues(output, 1); -for (int64_t i = 0; i < input.length; ++i) { - const int64_t remainder = out_data[i] % kMillisecondsInDay; - if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && - remainder > 0)) { -ctx->SetStatus( -Status::Invalid("Timestamp value had non-zero intraday milliseconds")); -break; + // Ensure that intraday milliseconds have been zeroed out + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && +remainder > 0)) { + ctx->SetStatus( + Status::Invalid("Timestamp value had non-zero intraday milliseconds")); + break; +} +out_data[i] -= remainder; +bit_reader.Next(); + } +} else { + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && remainder > 0)) { Review comment: No problem :) I'm still learning arrow. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > Labels: pull-request-available > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) > df['created'] = pd.to_datetime(df['created'])}} > schema = pa.schema([pa.field('created', pa.date64())]) > pa.Table.from_pandas(df, schema=schema) > {code} > Executing the above causes the python interpreter to exit with "Segmentation > fault: 11". > Attempting to convert into various other datatypes (by specifying different > schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2328) Writing a slice with feather ignores the offset
[ https://issues.apache.org/jira/browse/ARROW-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430786#comment-16430786 ] ASF GitHub Bot commented on ARROW-2328: --- pitrou commented on issue #1784: ARROW-2328: [C++] Fixed and unit tested feather writing with slice URL: https://github.com/apache/arrow/pull/1784#issuecomment-379806635 Thank you! I will merge once the AppVeyor build passes (the Travis-CI failures in the Rust and glib builds are unrelated). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Writing a slice with feather ignores the offset > --- > > Key: ARROW-2328 > URL: https://issues.apache.org/jira/browse/ARROW-2328 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Adrian >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Writing a slice from row n of length m of an array to feather would write the > first m rows, instead of the rows starting at n. > The null bitmap also ends up misaligned. Also tested and fixed in the pull > request below. > I've created a pull request with tests and fix here: > [Pullrequest#1766|https://github.com/apache/arrow/pull/1766] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430775#comment-16430775 ] ASF GitHub Bot commented on ARROW-2391: --- pitrou commented on issue #1859: ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 URL: https://github.com/apache/arrow/pull/1859#issuecomment-379803722 Waiting for the AppVeyor build before merging this. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > Labels: pull-request-available > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) > df['created'] = pd.to_datetime(df['created'])}} > schema = pa.schema([pa.field('created', pa.date64())]) > pa.Table.from_pandas(df, schema=schema) > {code} > Executing the above causes the python interpreter to exit with "Segmentation > fault: 11". > Attempting to convert into various other datatypes (by specifying different > schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back
[ https://issues.apache.org/jira/browse/ARROW-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Challis updated ARROW-2429: Description: When creating an Arrow table from a Pandas DataFrame, the table schema contains a field of type `timestamp[ns]`. When serialising that table to a parquet file and then immediately reading it back, the schema of the table read instead contains a field with type `timestamp[us]`. Minimal example: {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} was: When creating an Arrow table from a Pandas DataFrame, the table schema contains a field of type `timestamp[ns]`. When serialising that table to a parquet file and then immediately reading it back, the schema of the table read instead contains a field with type `timestamp[us]`. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} > [Python] Timestamp unit in schema changes when writing to Parquet file then > reading back > > > Key: ARROW-2429 > URL: https://issues.apache.org/jira/browse/ARROW-2429 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > PyArrow 0.9.0 (py36_1) > Python >Reporter: Dave Challis >Priority: Minor > > When creating an Arrow table from a Pandas DataFrame, the table schema > contains a field of type `timestamp[ns]`. > When serialising that table to a parquet file and then immediately reading it > back, the schema of the table read instead contains a field with type > `timestamp[us]`. > Minimal example: > > {code:python} > #!/usr/bin/env python > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # create DataFrame with a datetime column > df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) > df['created'] = pd.to_datetime(df['created']) > # create Arrow table from DataFrame > table = pa.Table.from_pandas(df, preserve_index=False) > # write the table as a parquet file, then read it back again > pq.write_table(table, 'foo.parquet') > table2 = pq.read_table('foo.parquet') > print(table.schema[0]) # pyarrow.Field (nanosecond > units) > print(table2.schema[0]) # pyarrow.Field (microsecond > units) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back
Dave Challis created ARROW-2429: --- Summary: [Python] Timestamp unit in schema changes when writing to Parquet file then reading back Key: ARROW-2429 URL: https://issues.apache.org/jira/browse/ARROW-2429 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Environment: Mac OS High Sierra PyArrow 0.9.0 (py36_1) Python Reporter: Dave Challis When creating an Arrow table from a Pandas DataFrame, the table schema contains a field of type `timestamp[ns]`. When serialising that table to a parquet file and then immediately reading it back, the schema of the table read instead contains a field with type `timestamp[us]`. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back
[ https://issues.apache.org/jira/browse/ARROW-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Challis updated ARROW-2429: Description: When creating an Arrow table from a Pandas DataFrame, the table schema contains a field of type `timestamp[ns]`. When serialising that table to a parquet file and then immediately reading it back, the schema of the table read instead contains a field with type `timestamp[us]`. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} was: When creating an Arrow table from a Pandas DataFrame, the table schema contains a field of type `timestamp[ns]`. When serialising that table to a parquet file and then immediately reading it back, the schema of the table read instead contains a field with type `timestamp[us]`. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} > [Python] Timestamp unit in schema changes when writing to Parquet file then > reading back > > > Key: ARROW-2429 > URL: https://issues.apache.org/jira/browse/ARROW-2429 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > PyArrow 0.9.0 (py36_1) > Python >Reporter: Dave Challis >Priority: Minor > > When creating an Arrow table from a Pandas DataFrame, the table schema > contains a field of type `timestamp[ns]`. > When serialising that table to a parquet file and then immediately reading it > back, the schema of the table read instead contains a field with type > `timestamp[us]`. > > {code:python} > #!/usr/bin/env python > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # create DataFrame with a datetime column > df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) > df['created'] = pd.to_datetime(df['created']) > # create Arrow table from DataFrame > table = pa.Table.from_pandas(df, preserve_index=False) > # write the table as a parquet file, then read it back again > pq.write_table(table, 'foo.parquet') > table2 = pq.read_table('foo.parquet') > print(table.schema[0]) # pyarrow.Field (nanosecond > units) > print(table2.schema[0]) # pyarrow.Field (microsecond > units) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2100) [Python] Drop Python 3.4 support
[ https://issues.apache.org/jira/browse/ARROW-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-2100: - Assignee: Antoine Pitrou > [Python] Drop Python 3.4 support > > > Key: ARROW-2100 > URL: https://issues.apache.org/jira/browse/ARROW-2100 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > conda-forge has already dropped it, Pandas dropped it in 0.21, we should also > think of dropping support for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2305) [Python] Cython 0.25.2 compilation failure
[ https://issues.apache.org/jira/browse/ARROW-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430753#comment-16430753 ] ASF GitHub Bot commented on ARROW-2305: --- pitrou closed pull request #1863: ARROW-2305: [Python] Bump Cython requirement to 0.27+ URL: https://github.com/apache/arrow/pull/1863 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/ci/msvc-build.bat b/ci/msvc-build.bat index 678e29d58..d3f540b2d 100644 --- a/ci/msvc-build.bat +++ b/ci/msvc-build.bat @@ -68,10 +68,8 @@ if "%JOB%" == "Build_Debug" ( exit /B 0 ) -@rem Note: avoid Cython 0.28.0 due to https://github.com/cython/cython/issues/2148 conda create -n arrow -q -y python=%PYTHON% ^ - six pytest setuptools numpy pandas ^ - cython=0.27.3 ^ + six pytest setuptools numpy pandas cython ^ thrift-cpp=0.11.0 call activate arrow diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index aa3c3154c..a776c4263 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -36,13 +36,12 @@ source activate $CONDA_ENV_DIR python --version which python -# Note: avoid Cython 0.28.0 due to https://github.com/cython/cython/issues/2148 conda install -y -q pip \ nomkl \ cloudpickle \ numpy=1.13.1 \ pandas \ - cython=0.27.3 + cython # ARROW-2093: PyTorch increases the size of our conda dependency stack # significantly, and so we have disabled these tests in Travis CI for now diff --git a/dev/release/verify-release-candidate.sh b/dev/release/verify-release-candidate.sh index 34aff209a..ef058d172 100755 --- a/dev/release/verify-release-candidate.sh +++ b/dev/release/verify-release-candidate.sh @@ -104,7 +104,7 @@ setup_miniconda() { numpy \ pandas \ six \ -cython=0.27.3 -c conda-forge +cython -c conda-forge source activate arrow-test } diff --git a/python/manylinux1/scripts/build_virtualenvs.sh b/python/manylinux1/scripts/build_virtualenvs.sh index 7e0d80cc7..a983721e9 100755 --- a/python/manylinux1/scripts/build_virtualenvs.sh +++ b/python/manylinux1/scripts/build_virtualenvs.sh @@ -34,7 +34,7 @@ for PYTHON_TUPLE in ${PYTHON_VERSIONS}; do echo "=== (${PYTHON}, ${U_WIDTH}) Installing build dependencies ===" $PIP install "numpy==1.10.4" -$PIP install "cython==0.27.3" +$PIP install "cython==0.28.1" $PIP install "pandas==0.20.3" $PIP install "virtualenv==15.1.0" diff --git a/python/setup.py b/python/setup.py index 7b0f17544..dd042c956 100644 --- a/python/setup.py +++ b/python/setup.py @@ -42,8 +42,8 @@ # Check if we're running 64-bit Python is_64_bit = sys.maxsize > 2**32 -if Cython.__version__ < '0.19.1': -raise Exception('Please upgrade to Cython 0.19.1 or newer') +if Cython.__version__ < '0.27': +raise Exception('Please upgrade to Cython 0.27 or newer') setup_dir = os.path.abspath(os.path.dirname(__file__)) @@ -491,7 +491,7 @@ def parse_version(root): ] }, use_scm_version={"root": "..", "relative_to": __file__, "parse": parse_version}, -setup_requires=['setuptools_scm', 'cython >= 0.23'] + setup_requires, +setup_requires=['setuptools_scm', 'cython >= 0.27'] + setup_requires, install_requires=install_requires, tests_require=['pytest', 'pandas'], description="Python library for Apache Arrow", This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Cython 0.25.2 compilation failure > --- > > Key: ARROW-2305 > URL: https://issues.apache.org/jira/browse/ARROW-2305 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Observed on master branch > {code} > Error compiling Cython file: > > ... > if hasattr(self, 'as_py'): > return repr(self.as_py()) > else: > return super(Scalar, self).__repr__() > def __eq__(self, other): >^ > > /home/wesm/code/arrow/python/pyarrow/scalar.pxi:67:4: Special method __eq__ > must be implemented via __richcmp__ > Error compiling Cython file: >
[jira] [Resolved] (ARROW-2305) [Python] Cython 0.25.2 compilation failure
[ https://issues.apache.org/jira/browse/ARROW-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-2305. --- Resolution: Fixed Issue resolved by pull request 1863 [https://github.com/apache/arrow/pull/1863] > [Python] Cython 0.25.2 compilation failure > --- > > Key: ARROW-2305 > URL: https://issues.apache.org/jira/browse/ARROW-2305 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Observed on master branch > {code} > Error compiling Cython file: > > ... > if hasattr(self, 'as_py'): > return repr(self.as_py()) > else: > return super(Scalar, self).__repr__() > def __eq__(self, other): >^ > > /home/wesm/code/arrow/python/pyarrow/scalar.pxi:67:4: Special method __eq__ > must be implemented via __richcmp__ > Error compiling Cython file: > > ... > Return true if the tensors contains exactly equal data > """ > self._validate() > return self.tp.Equals(deref(other.tp)) > def __eq__(self, other): >^ > > /home/wesm/code/arrow/python/pyarrow/array.pxi:571:4: Special method __eq__ > must be implemented via __richcmp__ > Error compiling Cython file: > > ... > cdef c_bool result = False > with nogil: > result = self.buffer.get().Equals(deref(other.buffer.get())) > return result > def __eq__(self, other): >^ > > /home/wesm/code/arrow/python/pyarrow/io.pxi:675:4: Special method __eq__ must > be implemented via __richcmp__ > {code} > Upgrading Cython made this go away. We should probably use {{__richcmp__}} > though -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2305) [Python] Cython 0.25.2 compilation failure
[ https://issues.apache.org/jira/browse/ARROW-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430748#comment-16430748 ] ASF GitHub Bot commented on ARROW-2305: --- pitrou commented on issue #1863: ARROW-2305: [Python] Bump Cython requirement to 0.27+ URL: https://github.com/apache/arrow/pull/1863#issuecomment-379798200 AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.270 The Travis-CI failure is unrelated. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Cython 0.25.2 compilation failure > --- > > Key: ARROW-2305 > URL: https://issues.apache.org/jira/browse/ARROW-2305 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Observed on master branch > {code} > Error compiling Cython file: > > ... > if hasattr(self, 'as_py'): > return repr(self.as_py()) > else: > return super(Scalar, self).__repr__() > def __eq__(self, other): >^ > > /home/wesm/code/arrow/python/pyarrow/scalar.pxi:67:4: Special method __eq__ > must be implemented via __richcmp__ > Error compiling Cython file: > > ... > Return true if the tensors contains exactly equal data > """ > self._validate() > return self.tp.Equals(deref(other.tp)) > def __eq__(self, other): >^ > > /home/wesm/code/arrow/python/pyarrow/array.pxi:571:4: Special method __eq__ > must be implemented via __richcmp__ > Error compiling Cython file: > > ... > cdef c_bool result = False > with nogil: > result = self.buffer.get().Equals(deref(other.buffer.get())) > return result > def __eq__(self, other): >^ > > /home/wesm/code/arrow/python/pyarrow/io.pxi:675:4: Special method __eq__ must > be implemented via __richcmp__ > {code} > Upgrading Cython made this go away. We should probably use {{__richcmp__}} > though -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2100) [Python] Drop Python 3.4 support
[ https://issues.apache.org/jira/browse/ARROW-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-2100. --- Resolution: Fixed Issue resolved by pull request 1862 [https://github.com/apache/arrow/pull/1862] > [Python] Drop Python 3.4 support > > > Key: ARROW-2100 > URL: https://issues.apache.org/jira/browse/ARROW-2100 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > conda-forge has already dropped it, Pandas dropped it in 0.21, we should also > think of dropping support for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2100) [Python] Drop Python 3.4 support
[ https://issues.apache.org/jira/browse/ARROW-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430746#comment-16430746 ] ASF GitHub Bot commented on ARROW-2100: --- pitrou closed pull request #1862: ARROW-2100: [Python] Drop Python 3.4 support URL: https://github.com/apache/arrow/pull/1862 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh index 6697733d0..9742da09f 100755 --- a/python/manylinux1/build_arrow.sh +++ b/python/manylinux1/build_arrow.sh @@ -26,7 +26,7 @@ # * Copyright (c) 2013-2016, Matt Terry and Matthew Brett (BSD 2-clause) # Build different python versions with various unicode widths -PYTHON_VERSIONS="${PYTHON_VERSIONS:-2.7,16 2.7,32 3.4,16 3.5,16 3.6,16}" +PYTHON_VERSIONS="${PYTHON_VERSIONS:-2.7,16 2.7,32 3.5,16 3.6,16}" source /multibuild/manylinux_utils.sh diff --git a/python/setup.py b/python/setup.py index 7b0f17544..d9a68846b 100644 --- a/python/setup.py +++ b/python/setup.py @@ -500,7 +500,6 @@ def parse_version(root): classifiers=[ 'License :: OSI Approved :: Apache Software License', 'Programming Language :: Python :: 2.7', -'Programming Language :: Python :: 3.4', 'Programming Language :: Python :: 3.5', 'Programming Language :: Python :: 3.6' ], This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Drop Python 3.4 support > > > Key: ARROW-2100 > URL: https://issues.apache.org/jira/browse/ARROW-2100 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > conda-forge has already dropped it, Pandas dropped it in 0.21, we should also > think of dropping support for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2428) [Python] Support ExtensionArrays in to_pandas conversion
[ https://issues.apache.org/jira/browse/ARROW-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-2428: --- Labels: beginner (was: ) > [Python] Support ExtensionArrays in to_pandas conversion > > > Key: ARROW-2428 > URL: https://issues.apache.org/jira/browse/ARROW-2428 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Priority: Major > Labels: beginner > Fix For: 1.0.0 > > > With the next release of Pandas, it will be possible to define custom column > types that back a {{pandas.Series}}. Thus we will not be able to cover all > possible column types in the {{to_pandas}} conversion by default as we won't > be aware of all extension arrays. > To enable users to create {{ExtensionArray}} instances from Arrow columns in > the {{to_pandas}} conversion, we should provide a hook in the {{to_pandas}} > call where they can overload the default conversion routines with the ones > that produce their {{ExtensionArray}} instances. > This should avoid additional copies in the case where we would nowadays first > convert the Arrow column into a default Pandas column (probably of object > type) and the user would afterwards convert it to a more efficient > {{ExtensionArray}}. This hook here will be especially useful when you build > {{ExtensionArrays}} where the storage is backed by Arrow. > The meta-issue that tracks the implementation inside of Pandas is: > https://github.com/pandas-dev/pandas/issues/19696 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2428) [Python] Support ExtensionArrays in to_pandas conversion
Uwe L. Korn created ARROW-2428: -- Summary: [Python] Support ExtensionArrays in to_pandas conversion Key: ARROW-2428 URL: https://issues.apache.org/jira/browse/ARROW-2428 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Uwe L. Korn Fix For: 1.0.0 With the next release of Pandas, it will be possible to define custom column types that back a {{pandas.Series}}. Thus we will not be able to cover all possible column types in the {{to_pandas}} conversion by default as we won't be aware of all extension arrays. To enable users to create {{ExtensionArray}} instances from Arrow columns in the {{to_pandas}} conversion, we should provide a hook in the {{to_pandas}} call where they can overload the default conversion routines with the ones that produce their {{ExtensionArray}} instances. This should avoid additional copies in the case where we would nowadays first convert the Arrow column into a default Pandas column (probably of object type) and the user would afterwards convert it to a more efficient {{ExtensionArray}}. This hook here will be especially useful when you build {{ExtensionArrays}} where the storage is backed by Arrow. The meta-issue that tracks the implementation inside of Pandas is: https://github.com/pandas-dev/pandas/issues/19696 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2424) [Rust] Missing import causing broken build
[ https://issues.apache.org/jira/browse/ARROW-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430737#comment-16430737 ] ASF GitHub Bot commented on ARROW-2424: --- pitrou closed pull request #1864: ARROW-2424: [Rust] Fix build - add missing import URL: https://github.com/apache/arrow/pull/1864 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/rust/src/builder.rs b/rust/src/builder.rs index 832b2a4a8..9915a8b52 100644 --- a/rust/src/builder.rs +++ b/rust/src/builder.rs @@ -18,6 +18,7 @@ use libc; use std::mem; use std::ptr; +use std::slice; use super::buffer::*; use super::memory::*; This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Missing import causing broken build > -- > > Key: ARROW-2424 > URL: https://issues.apache.org/jira/browse/ARROW-2424 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: JS-0.4.0 > > > Recent merges broke the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2424) [Rust] Missing import causing broken build
[ https://issues.apache.org/jira/browse/ARROW-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-2424. --- Resolution: Fixed Fix Version/s: (was: 0.10.0) JS-0.4.0 Issue resolved by pull request 1864 [https://github.com/apache/arrow/pull/1864] > [Rust] Missing import causing broken build > -- > > Key: ARROW-2424 > URL: https://issues.apache.org/jira/browse/ARROW-2424 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: JS-0.4.0 > > > Recent merges broke the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430734#comment-16430734 ] ASF GitHub Bot commented on ARROW-2391: --- pitrou commented on a change in pull request #1859: ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 URL: https://github.com/apache/arrow/pull/1859#discussion_r180136727 ## File path: cpp/src/arrow/compute/kernels/cast.cc ## @@ -396,21 +396,34 @@ struct CastFunctor{ ShiftTime (ctx, options, conversion.first, conversion.second, input, output); -internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, - input.length); +if (input.null_count != 0) { + internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, +input.length); -// Ensure that intraday milliseconds have been zeroed out -auto out_data = GetMutableValues(output, 1); -for (int64_t i = 0; i < input.length; ++i) { - const int64_t remainder = out_data[i] % kMillisecondsInDay; - if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && - remainder > 0)) { -ctx->SetStatus( -Status::Invalid("Timestamp value had non-zero intraday milliseconds")); -break; + // Ensure that intraday milliseconds have been zeroed out + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && +remainder > 0)) { + ctx->SetStatus( + Status::Invalid("Timestamp value had non-zero intraday milliseconds")); + break; +} +out_data[i] -= remainder; +bit_reader.Next(); + } +} else { + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && remainder > 0)) { Review comment: Wow. Sorry, I had completely overlooked the `out_data[i] -= remainder;` line :-S This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > Labels: pull-request-available > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) > df['created'] = pd.to_datetime(df['created'])}} > schema = pa.schema([pa.field('created', pa.date64())]) > pa.Table.from_pandas(df, schema=schema) > {code} > Executing the above causes the python interpreter to exit with "Segmentation > fault: 11". > Attempting to convert into various other datatypes (by specifying different > schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1964) [Python] Expose Builder classes
[ https://issues.apache.org/jira/browse/ARROW-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-1964: --- Description: Having the builder classes available from Python would be very helpful. Currently a construction of an Arrow array always need to have a Python list or numpy array as intermediate. As the builder in combination with jemalloc are very efficient in building up non-chunked memory, it would be nice to directly use them in certain cases. The most useful builders are the [StringBuilder|https://github.com/apache/arrow/blob/5030e235047bdffabf6a900dd39b64eeeb96bdc8/cpp/src/arrow/builder.h#L714] and [DictionaryBuilder|https://github.com/apache/arrow/blob/5030e235047bdffabf6a900dd39b64eeeb96bdc8/cpp/src/arrow/builder.h#L872] as they provide functionality to create columns that are not easily constructed using NumPy methods in Python. The basic approach would be to wrap the C++ classes in https://github.com/apache/arrow/blob/master/python/pyarrow/includes/libarrow.pxd so that they can be used from Cython. Afterwards, we should start a new file {{python/pyarrow/builder.pxi}} where we have classes take typical Python objects like {{str}} and pass them on to the C++ classes. At the end, these classes should also return (Python accessible) {{pyarrow.Array}} instances. was:Having the builder classes available from Python would be very helpful. Currently a construction of an Arrow array always need to have a Python list or numpy array as intermediate. As the builder in combination with jemalloc are very efficient in building up non-chunked memory, it would be nice to directly use them in certain cases. > [Python] Expose Builder classes > --- > > Key: ARROW-1964 > URL: https://issues.apache.org/jira/browse/ARROW-1964 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Uwe L. Korn >Priority: Major > Labels: beginner > Fix For: 1.0.0 > > > Having the builder classes available from Python would be very helpful. > Currently a construction of an Arrow array always need to have a Python list > or numpy array as intermediate. As the builder in combination with jemalloc > are very efficient in building up non-chunked memory, it would be nice to > directly use them in certain cases. > The most useful builders are the > [StringBuilder|https://github.com/apache/arrow/blob/5030e235047bdffabf6a900dd39b64eeeb96bdc8/cpp/src/arrow/builder.h#L714] > and > [DictionaryBuilder|https://github.com/apache/arrow/blob/5030e235047bdffabf6a900dd39b64eeeb96bdc8/cpp/src/arrow/builder.h#L872] > as they provide functionality to create columns that are not easily > constructed using NumPy methods in Python. > The basic approach would be to wrap the C++ classes in > https://github.com/apache/arrow/blob/master/python/pyarrow/includes/libarrow.pxd > so that they can be used from Cython. Afterwards, we should start a new file > {{python/pyarrow/builder.pxi}} where we have classes take typical Python > objects like {{str}} and pass them on to the C++ classes. At the end, these > classes should also return (Python accessible) {{pyarrow.Array}} instances. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2427) [C++] ReadAt implementations suboptimal
Antoine Pitrou created ARROW-2427: - Summary: [C++] ReadAt implementations suboptimal Key: ARROW-2427 URL: https://issues.apache.org/jira/browse/ARROW-2427 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.9.0 Reporter: Antoine Pitrou The {{ReadAt}} implementations for at least {{OSFile}} and {{MemoryMappedFile}} take the file lock and seek. They could instead read directly from the given offset, allowing concurrent I/O from multiple threads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2326) cannot import pip installed pyarrow on OS X (10.9)
[ https://issues.apache.org/jira/browse/ARROW-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430679#comment-16430679 ] Uwe L. Korn commented on ARROW-2326: Yes it is. > cannot import pip installed pyarrow on OS X (10.9) > -- > > Key: ARROW-2326 > URL: https://issues.apache.org/jira/browse/ARROW-2326 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: OS X (10.9), Python 3.6 >Reporter: Paul Ivanov >Priority: Major > Fix For: 0.10.0 > > > {code:java} > $ pip3 install pyarrow --user > Collecting pyarrow > Using cached pyarrow-0.8.0-cp36-cp36m-macosx_10_6_intel.whl > Requirement already satisfied: six>=1.0.0 in > ./Library/Python/3.6/lib/python/site-packages (from pyarrow) > Collecting numpy>=1.10 (from pyarrow) > Using cached > numpy-1.14.2-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl > Installing collected packages: numpy, pyarrow > Successfully installed numpy-1.14.2 pyarrow-0.8.0 > $ python3 > Python 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow > Traceback (most recent call last): > File "", line 1, in > File > "/Users/pi/Library/Python/3.6/lib/python/site-packages/pyarrow/__init__.py", > line 32, in > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: > dlopen(/Users/pi/Library/Python/3.6/lib/python/site-packages/pyarrow/lib.cpython-36m-darwin.so, > 2): Library not loaded: @rpath/libarrow.0.dylib > Referenced from: > /Users/pi/Library/Python/3.6/lib/python/site-packages/pyarrow/lib.cpython-36m-darwin.so > Reason: image not found > {code} > I dug into it a bit and found that in older versions of install.rst, Wes > mentioned that XCode 6 had trouble with rpath, so not sure if that's what's > going on here for me. I'm on 10.9, I know it's really old, so if these wheels > can't be made to run on my ancient OS, I just wanted to report this so the > wheels uploaded to PyPI can reflect this incompatibility, if that is indeed > the case. I might also try some otool / install_name_tool tomfoolery to see > if I can get a workaround for myself. > Thank you! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2326) cannot import pip installed pyarrow on OS X (10.9)
[ https://issues.apache.org/jira/browse/ARROW-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430674#comment-16430674 ] Phillip Cloud commented on ARROW-2326: -- [~xhochy] Is this fixed? > cannot import pip installed pyarrow on OS X (10.9) > -- > > Key: ARROW-2326 > URL: https://issues.apache.org/jira/browse/ARROW-2326 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: OS X (10.9), Python 3.6 >Reporter: Paul Ivanov >Priority: Major > Fix For: 0.10.0 > > > {code:java} > $ pip3 install pyarrow --user > Collecting pyarrow > Using cached pyarrow-0.8.0-cp36-cp36m-macosx_10_6_intel.whl > Requirement already satisfied: six>=1.0.0 in > ./Library/Python/3.6/lib/python/site-packages (from pyarrow) > Collecting numpy>=1.10 (from pyarrow) > Using cached > numpy-1.14.2-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl > Installing collected packages: numpy, pyarrow > Successfully installed numpy-1.14.2 pyarrow-0.8.0 > $ python3 > Python 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow > Traceback (most recent call last): > File "", line 1, in > File > "/Users/pi/Library/Python/3.6/lib/python/site-packages/pyarrow/__init__.py", > line 32, in > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: > dlopen(/Users/pi/Library/Python/3.6/lib/python/site-packages/pyarrow/lib.cpython-36m-darwin.so, > 2): Library not loaded: @rpath/libarrow.0.dylib > Referenced from: > /Users/pi/Library/Python/3.6/lib/python/site-packages/pyarrow/lib.cpython-36m-darwin.so > Reason: image not found > {code} > I dug into it a bit and found that in older versions of install.rst, Wes > mentioned that XCode 6 had trouble with rpath, so not sure if that's what's > going on here for me. I'm on 10.9, I know it's really old, so if these wheels > can't be made to run on my ancient OS, I just wanted to report this so the > wheels uploaded to PyPI can reflect this incompatibility, if that is indeed > the case. I might also try some otool / install_name_tool tomfoolery to see > if I can get a workaround for myself. > Thank you! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-564) [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array if there are nulls)
[ https://issues.apache.org/jira/browse/ARROW-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-564: -- Labels: beginner (was: ) > [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array > if there are nulls) > > > Key: ARROW-564 > URL: https://issues.apache.org/jira/browse/ARROW-564 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1964) [Python] Expose Builder classes
[ https://issues.apache.org/jira/browse/ARROW-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-1964: --- Labels: beginner (was: ) > [Python] Expose Builder classes > --- > > Key: ARROW-1964 > URL: https://issues.apache.org/jira/browse/ARROW-1964 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Uwe L. Korn >Priority: Major > Labels: beginner > Fix For: 1.0.0 > > > Having the builder classes available from Python would be very helpful. > Currently a construction of an Arrow array always need to have a Python list > or numpy array as intermediate. As the builder in combination with jemalloc > are very efficient in building up non-chunked memory, it would be nice to > directly use them in certain cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1964) [Python] Expose Builder classes
[ https://issues.apache.org/jira/browse/ARROW-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-1964: --- Summary: [Python] Expose Builder classes (was: Python: Expose Builder classes) > [Python] Expose Builder classes > --- > > Key: ARROW-1964 > URL: https://issues.apache.org/jira/browse/ARROW-1964 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Uwe L. Korn >Priority: Major > Labels: beginner > Fix For: 1.0.0 > > > Having the builder classes available from Python would be very helpful. > Currently a construction of an Arrow array always need to have a Python list > or numpy array as intermediate. As the builder in combination with jemalloc > are very efficient in building up non-chunked memory, it would be nice to > directly use them in certain cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted
[ https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430650#comment-16430650 ] ASF GitHub Bot commented on ARROW-2369: --- pitrou opened a new pull request #1866: ARROW-2369: [Python] Fix reading large Parquet files (> 4 GB) URL: https://github.com/apache/arrow/pull/1866 - Fix PythonFile.seek() for offsets > 4 GB - Avoid instantiating a PythonFile in ParquetFile, for efficiency This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Large (>~20 GB) files written to Parquet via PyArrow are corrupted > -- > > Key: ARROW-2369 > URL: https://issues.apache.org/jira/browse/ARROW-2369 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Reproduced on Ubuntu + Mac OSX >Reporter: Justin Tan >Assignee: Antoine Pitrou >Priority: Major > Labels: Parquet, bug, pandas, parquetWriter, > pull-request-available, pyarrow > Fix For: 0.10.0 > > Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png > > > When writing large Parquet files (above 10 GB or so) from Pandas to Parquet > via the command > {{pq.write_table(my_df, 'table.parquet')}} > The write succeeds, but when the parquet file is loaded, the error message > {{ArrowIOError: Invalid parquet file. Corrupt footer.}} > appears. This same error occurs when the parquet file is written chunkwise as > well. When the parquet files are small, say < 5 GB or so (drawn randomly from > the same dataset), everything proceeds as normal. I've also tried this with > Pandas df.to_parquet(), with the same results. > Update: Looks like any DataFrame with size above ~5GB (on disk) returns the > same error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430645#comment-16430645 ] ASF GitHub Bot commented on ARROW-2391: --- kszucs commented on a change in pull request #1859: ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 URL: https://github.com/apache/arrow/pull/1859#discussion_r180122730 ## File path: cpp/src/arrow/compute/kernels/cast.cc ## @@ -396,21 +396,34 @@ struct CastFunctor{ ShiftTime (ctx, options, conversion.first, conversion.second, input, output); -internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, - input.length); +if (input.null_count != 0) { + internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, +input.length); -// Ensure that intraday milliseconds have been zeroed out -auto out_data = GetMutableValues(output, 1); -for (int64_t i = 0; i < input.length; ++i) { - const int64_t remainder = out_data[i] % kMillisecondsInDay; - if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && - remainder > 0)) { -ctx->SetStatus( -Status::Invalid("Timestamp value had non-zero intraday milliseconds")); -break; + // Ensure that intraday milliseconds have been zeroed out + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && +remainder > 0)) { + ctx->SetStatus( + Status::Invalid("Timestamp value had non-zero intraday milliseconds")); + break; +} +out_data[i] -= remainder; +bit_reader.Next(); + } +} else { + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && remainder > 0)) { Review comment: Sure, but don't we need another branch then to handle when time truncation is allowed? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > Labels: pull-request-available > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) > df['created'] = pd.to_datetime(df['created'])}} > schema = pa.schema([pa.field('created', pa.date64())]) > pa.Table.from_pandas(df, schema=schema) > {code} > Executing the above causes the python interpreter to exit with "Segmentation > fault: 11". > Attempting to convert into various other datatypes (by specifying different > schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2424) [Rust] Missing import causing broken build
[ https://issues.apache.org/jira/browse/ARROW-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430631#comment-16430631 ] ASF GitHub Bot commented on ARROW-2424: --- andygrove commented on issue #1864: ARROW-2424: [Rust] Fix build - add missing import URL: https://github.com/apache/arrow/pull/1864#issuecomment-379777861 @pitrou I updated it as requested This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Missing import causing broken build > -- > > Key: ARROW-2424 > URL: https://issues.apache.org/jira/browse/ARROW-2424 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Recent merges broke the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2424) [Rust] Missing import causing broken build
[ https://issues.apache.org/jira/browse/ARROW-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2424: -- Labels: pull-request-available (was: ) > [Rust] Missing import causing broken build > -- > > Key: ARROW-2424 > URL: https://issues.apache.org/jira/browse/ARROW-2424 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Recent merges broke the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2426) [CI] glib build failure
[ https://issues.apache.org/jira/browse/ARROW-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430630#comment-16430630 ] Antoine Pitrou commented on ARROW-2426: --- [~kou] > [CI] glib build failure > --- > > Key: ARROW-2426 > URL: https://issues.apache.org/jira/browse/ARROW-2426 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The glib build on Travis-CI fails: > [https://travis-ci.org/apache/arrow/jobs/364123364#L6840] > {code} > ==> Installing gobject-introspection > ==> Downloading > https://homebrew.bintray.com/bottles/gobject-introspection-1.56.0_1.sierra.bottle.tar.gz > ==> Pouring gobject-introspection-1.56.0_1.sierra.bottle.tar.gz > /usr/local/Cellar/gobject-introspection/1.56.0_1: 173 files, 9.8MB > Installing gobject-introspection has failed! > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2426) [CI] glib build failure
Antoine Pitrou created ARROW-2426: - Summary: [CI] glib build failure Key: ARROW-2426 URL: https://issues.apache.org/jira/browse/ARROW-2426 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Antoine Pitrou The glib build on Travis-CI fails: [https://travis-ci.org/apache/arrow/jobs/364123364#L6840] {code} ==> Installing gobject-introspection ==> Downloading https://homebrew.bintray.com/bottles/gobject-introspection-1.56.0_1.sierra.bottle.tar.gz ==> Pouring gobject-introspection-1.56.0_1.sierra.bottle.tar.gz /usr/local/Cellar/gobject-introspection/1.56.0_1: 173 files, 9.8MB Installing gobject-introspection has failed! {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2422) Support more filter operators on Hive partitioned Parquet files
[ https://issues.apache.org/jira/browse/ARROW-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430628#comment-16430628 ] ASF GitHub Bot commented on ARROW-2422: --- xhochy commented on issue #1861: ARROW-2422 Support more operators for partition filtering URL: https://github.com/apache/arrow/pull/1861#issuecomment-379777120 Can you add unit tests for more than just integer as a type? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support more filter operators on Hive partitioned Parquet files > --- > > Key: ARROW-2422 > URL: https://issues.apache.org/jira/browse/ARROW-2422 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Julius Neuffer >Priority: Minor > Labels: features, pull-request-available > > After implementing basic filters ('=', '!=') on Hive partitioned Parquet > files (ARROW-2401), I'll extend them ('>', '<', '<=', '>=') with a new PR on > Github. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430623#comment-16430623 ] ASF GitHub Bot commented on ARROW-2391: --- pitrou commented on a change in pull request #1859: ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 URL: https://github.com/apache/arrow/pull/1859#discussion_r180118162 ## File path: cpp/src/arrow/compute/kernels/cast.cc ## @@ -396,21 +396,34 @@ struct CastFunctor{ ShiftTime (ctx, options, conversion.first, conversion.second, input, output); -internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, - input.length); +if (input.null_count != 0) { + internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, +input.length); -// Ensure that intraday milliseconds have been zeroed out -auto out_data = GetMutableValues(output, 1); -for (int64_t i = 0; i < input.length; ++i) { - const int64_t remainder = out_data[i] % kMillisecondsInDay; - if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && - remainder > 0)) { -ctx->SetStatus( -Status::Invalid("Timestamp value had non-zero intraday milliseconds")); -break; + // Ensure that intraday milliseconds have been zeroed out + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && +remainder > 0)) { + ctx->SetStatus( + Status::Invalid("Timestamp value had non-zero intraday milliseconds")); + break; +} +out_data[i] -= remainder; +bit_reader.Next(); + } +} else { + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && remainder > 0)) { Review comment: What I'm suggesting is: ```cpp if (!options.allow_time_truncate) { // Ensure that intraday milliseconds have been zeroed out auto out_data = GetMutableValues(output, 1); if (input.null_count != 0) { internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, input.length); for (int64_t i = 0; i < input.length; ++i) { const int64_t remainder = out_data[i] % kMillisecondsInDay; if (ARROW_PREDICT_FALSE(remainder > 0 && bit_reader.IsSet())) { ctx->SetStatus( Status::Invalid("Timestamp value had non-zero intraday milliseconds")); break; } out_data[i] -= remainder; bit_reader.Next(); } } else { for (int64_t i = 0; i < input.length; ++i) { const int64_t remainder = out_data[i] % kMillisecondsInDay; if (ARROW_PREDICT_FALSE(remainder > 0)) { ctx->SetStatus( Status::Invalid("Timestamp value had non-zero intraday milliseconds")); break; } out_data[i] -= remainder; } } } ``` Does it make sense? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > Labels: pull-request-available > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd >
[jira] [Commented] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430574#comment-16430574 ] ASF GitHub Bot commented on ARROW-2391: --- kszucs commented on a change in pull request #1859: ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 URL: https://github.com/apache/arrow/pull/1859#discussion_r180106203 ## File path: cpp/src/arrow/compute/kernels/cast.cc ## @@ -396,21 +396,34 @@ struct CastFunctor{ ShiftTime (ctx, options, conversion.first, conversion.second, input, output); -internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, - input.length); +if (input.null_count != 0) { + internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, +input.length); -// Ensure that intraday milliseconds have been zeroed out -auto out_data = GetMutableValues(output, 1); -for (int64_t i = 0; i < input.length; ++i) { - const int64_t remainder = out_data[i] % kMillisecondsInDay; - if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && - remainder > 0)) { -ctx->SetStatus( -Status::Invalid("Timestamp value had non-zero intraday milliseconds")); -break; + // Ensure that intraday milliseconds have been zeroed out + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && +remainder > 0)) { + ctx->SetStatus( + Status::Invalid("Timestamp value had non-zero intraday milliseconds")); + break; +} +out_data[i] -= remainder; +bit_reader.Next(); + } +} else { + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && remainder > 0)) { Review comment: I might misunderstand, but: ```python # with allow_time_truncate [ '2018-05-10T00:00:00', '2018-05-11T00:00:00', '2018-05-12T10:24:01', ] # OK # without allow_time_truncate [ '2018-05-10T00:00:00', '2018-05-11T00:00:00', '2018-05-12T10:24:01', # <- fails here ] # with allow_time_truncate [ '2018-05-10T00:00:00', '2018-05-11T00:00:00', '2018-05-12T00:00:00', ] # OK # without allow_time_truncate [ '2018-05-10T00:00:00', '2018-05-11T00:00:00', '2018-05-12T00:00:00', ] # OK - this would fail if I test outside the loop ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > Labels: pull-request-available > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) > df['created'] = pd.to_datetime(df['created'])}} > schema = pa.schema([pa.field('created', pa.date64())]) > pa.Table.from_pandas(df, schema=schema) > {code} > Executing the above causes the python interpreter to exit with "Segmentation > fault: 11". > Attempting to convert into various other datatypes (by specifying different > schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted
[ https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430572#comment-16430572 ] Antoine Pitrou commented on ARROW-2369: --- Ok, there are two things going on: * when {{write_table()}} is called with a filepath string, it goes through {{PythonFile}}, which is probably inefficient * {{PythonFile.Seek}} doesn't handle seek offsets greater than 2**32 properly: {code:python} >>> f = open('/tmp/empty', 'wb') >>> f.truncate(1<<33 + 10) 8796093022208 >>> f.close() >>> f = open('/tmp/empty', 'rb') >>> paf = pa.PythonFile(f, 'rb') >>> paf.tell() 0 >>> paf.seek(5) 5 >>> paf.tell() 5 >>> paf.seek(1<<33 + 6) 0 >>> paf.tell() 0 >>> f.seek(1<<33 + 6) 549755813888 >>> f.tell() 549755813888 {code} > Large (>~20 GB) files written to Parquet via PyArrow are corrupted > -- > > Key: ARROW-2369 > URL: https://issues.apache.org/jira/browse/ARROW-2369 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Reproduced on Ubuntu + Mac OSX >Reporter: Justin Tan >Assignee: Antoine Pitrou >Priority: Major > Labels: Parquet, bug, pandas, parquetWriter, pyarrow > Fix For: 0.10.0 > > Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png > > > When writing large Parquet files (above 10 GB or so) from Pandas to Parquet > via the command > {{pq.write_table(my_df, 'table.parquet')}} > The write succeeds, but when the parquet file is loaded, the error message > {{ArrowIOError: Invalid parquet file. Corrupt footer.}} > appears. This same error occurs when the parquet file is written chunkwise as > well. When the parquet files are small, say < 5 GB or so (drawn randomly from > the same dataset), everything proceeds as normal. I've also tried this with > Pandas df.to_parquet(), with the same results. > Update: Looks like any DataFrame with size above ~5GB (on disk) returns the > same error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted
[ https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430571#comment-16430571 ] Justin Tan commented on ARROW-2369: --- Looks like the file is readable by early pyarrow versions (0.5.0 - but created by v0.5.0 as well), so maybe something went wrong from 0.5.0 -> 0.9.0 > Large (>~20 GB) files written to Parquet via PyArrow are corrupted > -- > > Key: ARROW-2369 > URL: https://issues.apache.org/jira/browse/ARROW-2369 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Reproduced on Ubuntu + Mac OSX >Reporter: Justin Tan >Assignee: Antoine Pitrou >Priority: Major > Labels: Parquet, bug, pandas, parquetWriter, pyarrow > Fix For: 0.10.0 > > Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png > > > When writing large Parquet files (above 10 GB or so) from Pandas to Parquet > via the command > {{pq.write_table(my_df, 'table.parquet')}} > The write succeeds, but when the parquet file is loaded, the error message > {{ArrowIOError: Invalid parquet file. Corrupt footer.}} > appears. This same error occurs when the parquet file is written chunkwise as > well. When the parquet files are small, say < 5 GB or so (drawn randomly from > the same dataset), everything proceeds as normal. I've also tried this with > Pandas df.to_parquet(), with the same results. > Update: Looks like any DataFrame with size above ~5GB (on disk) returns the > same error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2425) [Rust] Array::from missing mapping for u8 type
[ https://issues.apache.org/jira/browse/ARROW-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2425: -- Labels: pull-request-available (was: ) > [Rust] Array::from missing mapping for u8 type > -- > > Key: ARROW-2425 > URL: https://issues.apache.org/jira/browse/ARROW-2425 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Macros are used to support Array::from for each primitive type but u8 was > missing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2425) [Rust] Array::from missing mapping for u8 type
Andy Grove created ARROW-2425: - Summary: [Rust] Array::from missing mapping for u8 type Key: ARROW-2425 URL: https://issues.apache.org/jira/browse/ARROW-2425 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 0.10.0 Macros are used to support Array::from for each primitive type but u8 was missing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted
[ https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-2369: - Assignee: Antoine Pitrou > Large (>~20 GB) files written to Parquet via PyArrow are corrupted > -- > > Key: ARROW-2369 > URL: https://issues.apache.org/jira/browse/ARROW-2369 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Reproduced on Ubuntu + Mac OSX >Reporter: Justin Tan >Assignee: Antoine Pitrou >Priority: Major > Labels: Parquet, bug, pandas, parquetWriter, pyarrow > Fix For: 0.10.0 > > Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png > > > When writing large Parquet files (above 10 GB or so) from Pandas to Parquet > via the command > {{pq.write_table(my_df, 'table.parquet')}} > The write succeeds, but when the parquet file is loaded, the error message > {{ArrowIOError: Invalid parquet file. Corrupt footer.}} > appears. This same error occurs when the parquet file is written chunkwise as > well. When the parquet files are small, say < 5 GB or so (drawn randomly from > the same dataset), everything proceeds as normal. I've also tried this with > Pandas df.to_parquet(), with the same results. > Update: Looks like any DataFrame with size above ~5GB (on disk) returns the > same error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430566#comment-16430566 ] ASF GitHub Bot commented on ARROW-2391: --- pitrou commented on a change in pull request #1859: ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 URL: https://github.com/apache/arrow/pull/1859#discussion_r180103071 ## File path: cpp/src/arrow/compute/kernels/cast.cc ## @@ -396,21 +396,34 @@ struct CastFunctor{ ShiftTime (ctx, options, conversion.first, conversion.second, input, output); -internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, - input.length); +if (input.null_count != 0) { + internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, +input.length); -// Ensure that intraday milliseconds have been zeroed out -auto out_data = GetMutableValues(output, 1); -for (int64_t i = 0; i < input.length; ++i) { - const int64_t remainder = out_data[i] % kMillisecondsInDay; - if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && - remainder > 0)) { -ctx->SetStatus( -Status::Invalid("Timestamp value had non-zero intraday milliseconds")); -break; + // Ensure that intraday milliseconds have been zeroed out + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && +remainder > 0)) { + ctx->SetStatus( + Status::Invalid("Timestamp value had non-zero intraday milliseconds")); + break; +} +out_data[i] -= remainder; +bit_reader.Next(); + } +} else { + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && remainder > 0)) { Review comment: What I mean is that you can skip the whole thing is `options.allow_time_truncate` is true (the compiler might do the optimization for us, but still). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > Labels: pull-request-available > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) > df['created'] = pd.to_datetime(df['created'])}} > schema = pa.schema([pa.field('created', pa.date64())]) > pa.Table.from_pandas(df, schema=schema) > {code} > Executing the above causes the python interpreter to exit with "Segmentation > fault: 11". > Attempting to convert into various other datatypes (by specifying different > schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430564#comment-16430564 ] ASF GitHub Bot commented on ARROW-2391: --- kszucs commented on a change in pull request #1859: ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 URL: https://github.com/apache/arrow/pull/1859#discussion_r180102499 ## File path: cpp/src/arrow/compute/kernels/cast.cc ## @@ -396,21 +396,34 @@ struct CastFunctor{ ShiftTime (ctx, options, conversion.first, conversion.second, input, output); -internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, - input.length); +if (input.null_count != 0) { + internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, +input.length); -// Ensure that intraday milliseconds have been zeroed out -auto out_data = GetMutableValues(output, 1); -for (int64_t i = 0; i < input.length; ++i) { - const int64_t remainder = out_data[i] % kMillisecondsInDay; - if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && - remainder > 0)) { -ctx->SetStatus( -Status::Invalid("Timestamp value had non-zero intraday milliseconds")); -break; + // Ensure that intraday milliseconds have been zeroed out + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && +remainder > 0)) { + ctx->SetStatus( + Status::Invalid("Timestamp value had non-zero intraday milliseconds")); + break; +} +out_data[i] -= remainder; +bit_reader.Next(); + } +} else { + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && remainder > 0)) { Review comment: Doesn't the first value encountered with time part trigger the error - which has to be checked inside the loop? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > Labels: pull-request-available > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) > df['created'] = pd.to_datetime(df['created'])}} > schema = pa.schema([pa.field('created', pa.date64())]) > pa.Table.from_pandas(df, schema=schema) > {code} > Executing the above causes the python interpreter to exit with "Segmentation > fault: 11". > Attempting to convert into various other datatypes (by specifying different > schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430548#comment-16430548 ] ASF GitHub Bot commented on ARROW-2391: --- pitrou commented on a change in pull request #1859: ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 URL: https://github.com/apache/arrow/pull/1859#discussion_r180100389 ## File path: cpp/src/arrow/compute/kernels/cast.cc ## @@ -396,21 +396,34 @@ struct CastFunctor{ ShiftTime (ctx, options, conversion.first, conversion.second, input, output); -internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, - input.length); +if (input.null_count != 0) { + internal::BitmapReader bit_reader(input.buffers[0]->data(), input.offset, +input.length); -// Ensure that intraday milliseconds have been zeroed out -auto out_data = GetMutableValues(output, 1); -for (int64_t i = 0; i < input.length; ++i) { - const int64_t remainder = out_data[i] % kMillisecondsInDay; - if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && - remainder > 0)) { -ctx->SetStatus( -Status::Invalid("Timestamp value had non-zero intraday milliseconds")); -break; + // Ensure that intraday milliseconds have been zeroed out + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && bit_reader.IsSet() && +remainder > 0)) { + ctx->SetStatus( + Status::Invalid("Timestamp value had non-zero intraday milliseconds")); + break; +} +out_data[i] -= remainder; +bit_reader.Next(); + } +} else { + auto out_data = GetMutableValues(output, 1); + for (int64_t i = 0; i < input.length; ++i) { +const int64_t remainder = out_data[i] % kMillisecondsInDay; +if (ARROW_PREDICT_FALSE(!options.allow_time_truncate && remainder > 0)) { Review comment: `options.allow_time_truncate` is a constant accross this whole piece of code, so just add a higher-level `if` statement around all this. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > Labels: pull-request-available > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) > df['created'] = pd.to_datetime(df['created'])}} > schema = pa.schema([pa.field('created', pa.date64())]) > pa.Table.from_pandas(df, schema=schema) > {code} > Executing the above causes the python interpreter to exit with "Segmentation > fault: 11". > Attempting to convert into various other datatypes (by specifying different > schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2424) [Rust] Missing import causing broken build
[ https://issues.apache.org/jira/browse/ARROW-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-2424: -- Component/s: Rust Summary: [Rust] Missing import causing broken build (was: Missing import causing broken build) > [Rust] Missing import causing broken build > -- > > Key: ARROW-2424 > URL: https://issues.apache.org/jira/browse/ARROW-2424 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 0.10.0 > > > Recent merges broke the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2424) Missing import causing broken build
Andy Grove created ARROW-2424: - Summary: Missing import causing broken build Key: ARROW-2424 URL: https://issues.apache.org/jira/browse/ARROW-2424 Project: Apache Arrow Issue Type: Bug Reporter: Andy Grove Assignee: Andy Grove Fix For: 0.10.0 Recent merges broke the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2424) Missing import causing broken build
[ https://issues.apache.org/jira/browse/ARROW-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430543#comment-16430543 ] Andy Grove commented on ARROW-2424: --- PR: https://github.com/apache/arrow/pull/1864 > Missing import causing broken build > --- > > Key: ARROW-2424 > URL: https://issues.apache.org/jira/browse/ARROW-2424 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 0.10.0 > > > Recent merges broke the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2423) [Python] PyArrow datatypes raise ValueError on equality checks against non-PyArrow objects
Dave Challis created ARROW-2423: --- Summary: [Python] PyArrow datatypes raise ValueError on equality checks against non-PyArrow objects Key: ARROW-2423 URL: https://issues.apache.org/jira/browse/ARROW-2423 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Environment: Mac OS High Sierra PyArrow 0.9.0 (py36_1) Python 3.6.3 Reporter: Dave Challis Checking a PyArrow datatype object for equality with non-PyArrow datatypes causes a `ValueError` to be raised, rather than either returning a True/False value, or returning [NotImplemented|https://docs.python.org/3/library/constants.html#NotImplemented] if the comparison isn't implemented. E.g. attempting to call: {code:java} import pyarrow pyarrow.int32() == 'foo' {code} results in: {code:java} Traceback (most recent call last): File "types.pxi", line 1221, in pyarrow.lib.type_for_alias KeyError: 'foo' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "t.py", line 2, in pyarrow.int32() == 'foo' File "types.pxi", line 90, in pyarrow.lib.DataType.__richcmp__ File "types.pxi", line 113, in pyarrow.lib.DataType.equals File "types.pxi", line 1223, in pyarrow.lib.type_for_alias ValueError: No type alias for foo {code} The expected outcome for the above would be for the comparison to return `False`, as that's the general behaviour for comparisons between objects of different types (e.g. `1 == 'foo'` or `object() == 12.4` both return `False`). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2328) Writing a slice with feather ignores the offset
[ https://issues.apache.org/jira/browse/ARROW-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430523#comment-16430523 ] ASF GitHub Bot commented on ARROW-2328: --- Adriandorr commented on a change in pull request #1784: ARROW-2328: [C++] Fixed and unit tested feather writing with slice URL: https://github.com/apache/arrow/pull/1784#discussion_r180093019 ## File path: cpp/src/arrow/ipc/test-common.h ## @@ -223,15 +223,17 @@ Status MakeRandomBinaryArray(int64_t length, bool include_nulls, MemoryPool* poo if (include_nulls && values_index == 0) { RETURN_NOT_OK(builder.AppendNull()); } else { - const std::string& value = values[values_index]; + const std::string value = + i < int64_t(values.size()) ? values[values_index] : std::to_string(i); Review comment: Not knowing the history of this particular function, I don't know what would be "better" in there. For my test I pretty much just want the consecutive numbers, otherwise it is very difficult to see what has gone wrong (I like the tests to give me the answer to that question if possible). I made a change to add a second function that implements that, but turns out this function is only used from MakeStringTypesRecordBatch, so we can't really have two implementations. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Writing a slice with feather ignores the offset > --- > > Key: ARROW-2328 > URL: https://issues.apache.org/jira/browse/ARROW-2328 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Adrian >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Writing a slice from row n of length m of an array to feather would write the > first m rows, instead of the rows starting at n. > The null bitmap also ends up misaligned. Also tested and fixed in the pull > request below. > I've created a pull request with tests and fix here: > [Pullrequest#1766|https://github.com/apache/arrow/pull/1766] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2422) Support more filter operators on Hive partitioned Parquet files
[ https://issues.apache.org/jira/browse/ARROW-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430493#comment-16430493 ] ASF GitHub Bot commented on ARROW-2422: --- jneuff commented on issue #1861: ARROW-2422 Support more operators for partition filtering URL: https://github.com/apache/arrow/pull/1861#issuecomment-379742811 @pacman82 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support more filter operators on Hive partitioned Parquet files > --- > > Key: ARROW-2422 > URL: https://issues.apache.org/jira/browse/ARROW-2422 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Julius Neuffer >Priority: Minor > Labels: features, pull-request-available > > After implementing basic filters ('=', '!=') on Hive partitioned Parquet > files (ARROW-2401), I'll extend them ('>', '<', '<=', '>=') with a new PR on > Github. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2420) [Rust] Memory is never released
[ https://issues.apache.org/jira/browse/ARROW-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430490#comment-16430490 ] ASF GitHub Bot commented on ARROW-2420: --- pitrou closed pull request #1860: ARROW-2420: [Rust] Fix major memory bug and add benches URL: https://github.com/apache/arrow/pull/1860 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/rust/Cargo.toml b/rust/Cargo.toml index c3120cfdc..4d2476b0c 100644 --- a/rust/Cargo.toml +++ b/rust/Cargo.toml @@ -36,4 +36,15 @@ path = "src/lib.rs" [dependencies] bytes = "0.4" libc = "0.2" -serde_json = "1.0.13" \ No newline at end of file +serde_json = "1.0.13" + +[dev-dependencies] +criterion = "0.2" + +[[bench]] +name = "array_from_vec" +harness = false + +[[bench]] +name = "array_from_builder" +harness = false \ No newline at end of file diff --git a/rust/benches/array_from_builder.rs b/rust/benches/array_from_builder.rs new file mode 100644 index 0..3d020030e --- /dev/null +++ b/rust/benches/array_from_builder.rs @@ -0,0 +1,49 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#[macro_use] +extern crate criterion; + +use criterion::Criterion; + +extern crate arrow; + +use arrow::array::*; +use arrow::builder::*; + +fn array_from_builder(n: usize) { +let mut v: Builder = Builder::with_capacity(n); +for i in 0..n { +v.push(i as i32); +} +Array::from(v.finish()); +} + +fn criterion_benchmark(c: Criterion) { +c.bench_function("array_from_builder 128", |b| { +b.iter(|| array_from_builder(128)) +}); +c.bench_function("array_from_builder 256", |b| { +b.iter(|| array_from_builder(256)) +}); +c.bench_function("array_from_builder 512", |b| { +b.iter(|| array_from_builder(512)) +}); +} + +criterion_group!(benches, criterion_benchmark); +criterion_main!(benches); diff --git a/rust/benches/array_from_vec.rs b/rust/benches/array_from_vec.rs new file mode 100644 index 0..0feb0de0b --- /dev/null +++ b/rust/benches/array_from_vec.rs @@ -0,0 +1,42 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#[macro_use] +extern crate criterion; + +use criterion::Criterion; + +extern crate arrow; + +use arrow::array::*; + +fn array_from_vec(n: usize) { +let mut v: Vec = Vec::with_capacity(n); +for i in 0..n { +v.push(i as i32); +} +Array::from(v); +} + +fn criterion_benchmark(c: Criterion) { +c.bench_function("array_from_vec 128", |b| b.iter(|| array_from_vec(128))); +c.bench_function("array_from_vec 256", |b| b.iter(|| array_from_vec(256))); +c.bench_function("array_from_vec 512", |b| b.iter(|| array_from_vec(512))); +} + +criterion_group!(benches, criterion_benchmark); +criterion_main!(benches); diff --git a/rust/src/buffer.rs b/rust/src/buffer.rs index 1f2ec6c8d..1cf004fb1 100644 --- a/rust/src/buffer.rs +++ b/rust/src/buffer.rs @@ -74,7 +74,10 @@ impl Buffer { impl Drop for Buffer { fn drop( self) { -mem::drop(self.data) +unsafe { +let p = mem::transmute::<*const T, *mut libc::c_void>(self.data); +libc::free(p); +} } } diff --git a/rust/src/builder.rs
[jira] [Commented] (ARROW-2420) [Rust] Memory is never released
[ https://issues.apache.org/jira/browse/ARROW-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430492#comment-16430492 ] ASF GitHub Bot commented on ARROW-2420: --- pitrou commented on issue #1860: ARROW-2420: [Rust] Fix major memory bug and add benches URL: https://github.com/apache/arrow/pull/1860#issuecomment-379742452 Thanks @andygrove ! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Memory is never released > --- > > Key: ARROW-2420 > URL: https://issues.apache.org/jira/browse/ARROW-2420 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Another embarrassing bug ... the code was calling the wrong method to release > memory and wasn't releasing memory. > I have added some benchmarks for testing performance of creating arrays (and > dropping them) and these are working well now after fixing the memory bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2420) [Rust] Memory is never released
[ https://issues.apache.org/jira/browse/ARROW-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-2420. --- Resolution: Fixed Issue resolved by pull request 1860 [https://github.com/apache/arrow/pull/1860] > [Rust] Memory is never released > --- > > Key: ARROW-2420 > URL: https://issues.apache.org/jira/browse/ARROW-2420 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Another embarrassing bug ... the code was calling the wrong method to release > memory and wasn't releasing memory. > I have added some benchmarks for testing performance of creating arrays (and > dropping them) and these are working well now after fixing the memory bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2408) [Rust] It should be possible to get a [T] from Builder
[ https://issues.apache.org/jira/browse/ARROW-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430487#comment-16430487 ] ASF GitHub Bot commented on ARROW-2408: --- pitrou commented on issue #1846: ARROW-2408: [Rust] Ability to get ` [T]` from `Buffer` URL: https://github.com/apache/arrow/pull/1846#issuecomment-379741637 Great! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] It should be possible to get a [T] from Builder > - > > Key: ARROW-2408 > URL: https://issues.apache.org/jira/browse/ARROW-2408 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > I am currently adding Arrow support to the parquet-rs crate and I found a > need to get a mutable slice from a Buffer to pass to the parquet column > reader methods. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2408) [Rust] It should be possible to get a [T] from Builder
[ https://issues.apache.org/jira/browse/ARROW-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430486#comment-16430486 ] ASF GitHub Bot commented on ARROW-2408: --- pitrou closed pull request #1846: ARROW-2408: [Rust] Ability to get ` [T]` from `Buffer` URL: https://github.com/apache/arrow/pull/1846 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/rust/examples/array_from_builder.rs b/rust/examples/array_from_builder.rs index 3a273a64d..ea1ecec45 100644 --- a/rust/examples/array_from_builder.rs +++ b/rust/examples/array_from_builder.rs @@ -18,7 +18,6 @@ extern crate arrow; use arrow::array::*; -use arrow::buffer::*; use arrow::builder::*; fn main() { diff --git a/rust/src/buffer.rs b/rust/src/buffer.rs index ab90a5b08..1f2ec6c8d 100644 --- a/rust/src/buffer.rs +++ b/rust/src/buffer.rs @@ -18,7 +18,6 @@ use bytes::Bytes; use libc; use std::mem; -use std::ptr; use std::slice; use super::memory::*; diff --git a/rust/src/builder.rs b/rust/src/builder.rs index 1cc024042..ebdf3a942 100644 --- a/rust/src/builder.rs +++ b/rust/src/builder.rs @@ -15,7 +15,6 @@ // specific language governing permissions and limitations // under the License. -use bytes::Bytes; use libc; use std::mem; use std::ptr; @@ -48,6 +47,21 @@ impl Builder { } } +/// Get the internal byte-aligned memory buffer as a mutable slice +pub fn slice_mut(, start: usize, end: usize) -> [T] { +assert!(start <= end); +assert!(start < self.len as usize); +assert!(end <= self.len as usize); +unsafe { +slice::from_raw_parts_mut(self.data.offset(start as isize), (end - start) as usize) +} +} + +/// Override the length +pub fn set_len( self, len: usize) { +self.len = len; +} + /// Push a value into the builder, growing the internal buffer as needed pub fn push( self, v: T) { assert!(!self.data.is_null()); This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] It should be possible to get a [T] from Builder > - > > Key: ARROW-2408 > URL: https://issues.apache.org/jira/browse/ARROW-2408 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > I am currently adding Arrow support to the parquet-rs crate and I found a > need to get a mutable slice from a Buffer to pass to the parquet column > reader methods. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2415) [Rust] Fix using references in pattern matching
[ https://issues.apache.org/jira/browse/ARROW-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430484#comment-16430484 ] ASF GitHub Bot commented on ARROW-2415: --- pitrou commented on issue #1851: ARROW-2415: [Rust] Fix clippy ref-match-pats warnings. URL: https://github.com/apache/arrow/pull/1851#issuecomment-379741162 Ok, thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Fix using references in pattern matching > --- > > Key: ARROW-2415 > URL: https://issues.apache.org/jira/browse/ARROW-2415 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Bruce Mitchener >Assignee: Bruce Mitchener >Priority: Major > Labels: pull-request-available > > Clippy reports > [https://rust-lang-nursery.github.io/rust-clippy/v0.0.191/index.html#match_ref_pats] > warnings. -- This message was sent by Atlassian JIRA (v7.6.3#76005)