[
https://issues.apache.org/jira/browse/ARROW-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654572#comment-17654572
]
Apache Arrow JIRA Bot commented on ARROW-12666:
-----------------------------------------------
This issue was last updated over 90 days ago, which may be an indication it is
no longer being actively worked. To better reflect the current state, the issue
is being unassigned per [project
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
Please feel free to re-take assignment of the issue if it is being actively
worked, or if you plan to start that work soon.
> [Python] Array construction from numpy array is unclear about zero copy
> behaviour
> ---------------------------------------------------------------------------------
>
> Key: ARROW-12666
> URL: https://issues.apache.org/jira/browse/ARROW-12666
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 4.0.0
> Reporter: Alessandro Molina
> Assignee: Alessandro Molina
> Priority: Major
> Labels: good-second-issue
>
> When building an Arrow array from a numpy array it's very confusing from the
> user point of view that the result is not always a new array.
> Under the hood Arrow sometimes reuses the memory if no casting is needed
> {code:python}
> npa = np.array([1, 2, 3]*3)
> arrow_array = pa.array(npa, type=pa.int64())
> npa[npa == 2] = 10
> print(arrow_array.to_pylist())
> # Prints: [1, 10, 3, 1, 10, 3, 1, 10, 3]
> {code}
> and sometimes doesn't if a cast is involved
> {code:python}
> npa = np.array([1, 2, 3]*3)
> arrow_array = pa.array(npa, type=pa.int32())
> npa[npa == 2] = 10
> print(arrow_array.to_pylist())
> # Prints: [1, 2, 3, 1, 2, 3, 1, 2, 3]
> {code}
> For non primite types instead it does always copy
> {code:python}
> npa = np.array(["a", "b", "c"]*3)
> arrow_array = pa.array(npa, type=pa.string())
> npa[npa == "b"] = "X"
> print(arrow_array.to_pylist())
> # Prints: ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c']
> # Different from numpy array that was modified
> {code}
> This behaviour needs a lot of attention from the user and understanding of
> what's going on, which makes pyarrow hard to use.
> A {{copy=True/False}} should be added to {{pa.array}} and the default value
> should probably be {{copy=True}} so that by default you can always create an
> arrow array out of a numpy one (as {{copy=False}} would probably have to
> throw an exception in some cases where we can't guarantee zero copy, like
> when building from a Python List)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)