[jira] [Updated] (ARROW-7370) [C++] Old Protobuf with AUTO detection is failed
[ https://issues.apache.org/jira/browse/ARROW-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7370: -- Labels: pull-request-available (was: ) > [C++] Old Protobuf with AUTO detection is failed > > > Key: ARROW-7370 > URL: https://issues.apache.org/jira/browse/ARROW-7370 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > > {noformat} > -- Could NOT find Protobuf: Found unsuitable version "3.6.1", but required is > at least "3.7.0" (found /usr/lib/x86_64-linux-gnu/libprotobuf.so;-pthread) > Building Protocol Buffers from source > CMake Error at cmake_modules/ThirdpartyToolchain.cmake:1179 (add_library): > add_library cannot create imported target "protobuf::libprotobuf" because > another target with the same name already exists. > Call Stack (most recent call first): > cmake_modules/ThirdpartyToolchain.cmake:147 (build_protobuf) > cmake_modules/ThirdpartyToolchain.cmake:178 (build_dependency) > cmake_modules/ThirdpartyToolchain.cmake:1204 > (resolve_dependency_with_version) > CMakeLists.txt:428 (include) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7370) [C++] Old Protobuf with AUTO detection is failed
Kouhei Sutou created ARROW-7370: --- Summary: [C++] Old Protobuf with AUTO detection is failed Key: ARROW-7370 URL: https://issues.apache.org/jira/browse/ARROW-7370 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou Assignee: Kouhei Sutou {noformat} -- Could NOT find Protobuf: Found unsuitable version "3.6.1", but required is at least "3.7.0" (found /usr/lib/x86_64-linux-gnu/libprotobuf.so;-pthread) Building Protocol Buffers from source CMake Error at cmake_modules/ThirdpartyToolchain.cmake:1179 (add_library): add_library cannot create imported target "protobuf::libprotobuf" because another target with the same name already exists. Call Stack (most recent call first): cmake_modules/ThirdpartyToolchain.cmake:147 (build_protobuf) cmake_modules/ThirdpartyToolchain.cmake:178 (build_dependency) cmake_modules/ThirdpartyToolchain.cmake:1204 (resolve_dependency_with_version) CMakeLists.txt:428 (include) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7369) [GLib] Add garrow_table_combine_chunks
[ https://issues.apache.org/jira/browse/ARROW-7369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7369: -- Labels: pull-request-available (was: ) > [GLib] Add garrow_table_combine_chunks > -- > > Key: ARROW-7369 > URL: https://issues.apache.org/jira/browse/ARROW-7369 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Reporter: Kenta Murata >Assignee: Kenta Murata >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7369) [GLib] Add garrow_table_combine_chunks
Kenta Murata created ARROW-7369: --- Summary: [GLib] Add garrow_table_combine_chunks Key: ARROW-7369 URL: https://issues.apache.org/jira/browse/ARROW-7369 Project: Apache Arrow Issue Type: New Feature Components: GLib Reporter: Kenta Murata Assignee: Kenta Murata -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7272) [C++][Java] JNI bridge between RecordBatch and VectorSchemaRoot
[ https://issues.apache.org/jira/browse/ARROW-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993151#comment-16993151 ] Hongze Zhang commented on ARROW-7272: - Hi guys, would you suggest to just use the existing *org.apache.arrow.vector.ipc.ArrowReader*? We already have a similar approach in orc adaptor and it works fine. As schemas in Datasets API are always predefined I think we don't have to convert the schema everytime. > [C++][Java] JNI bridge between RecordBatch and VectorSchemaRoot > --- > > Key: ARROW-7272 > URL: https://issues.apache.org/jira/browse/ARROW-7272 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Java >Reporter: Francois Saint-Jacques >Priority: Major > > Given a C++ std::shared_ptr, retrieve it in java as a > VectorSchemaRoot class. Gandiva already offer a similar facility but with raw > buffers. It would be convenient if users could call C++ that yields > RecordBatch and retrieve it in a seamless fashion. > This would remove one roadblock of using C++ dataset facility in Java. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7368) [Ruby] Use :arrow_file and :arrow_stream for format name
[ https://issues.apache.org/jira/browse/ARROW-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7368: -- Labels: pull-request-available (was: ) > [Ruby] Use :arrow_file and :arrow_stream for format name > > > Key: ARROW-7368 > URL: https://issues.apache.org/jira/browse/ARROW-7368 > Project: Apache Arrow > Issue Type: Improvement > Components: Ruby >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7368) [Ruby] Use :arrow_file and :arrow_stream for format name
Kouhei Sutou created ARROW-7368: --- Summary: [Ruby] Use :arrow_file and :arrow_stream for format name Key: ARROW-7368 URL: https://issues.apache.org/jira/browse/ARROW-7368 Project: Apache Arrow Issue Type: Improvement Components: Ruby Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6965) [C++][Dataset] Optionally expose partition keys as materialized columns
[ https://issues.apache.org/jira/browse/ARROW-6965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman resolved ARROW-6965. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5950 [https://github.com/apache/arrow/pull/5950] > [C++][Dataset] Optionally expose partition keys as materialized columns > --- > > Key: ARROW-6965 > URL: https://issues.apache.org/jira/browse/ARROW-6965 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 7h 50m > Remaining Estimate: 0h > > This would be exposed in the DataSourceDiscovery as an option. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7361) [Rust] Build directory is not passed to ci/scripts/rust_test.sh
[ https://issues.apache.org/jira/browse/ARROW-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-7361. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6004 [https://github.com/apache/arrow/pull/6004] > [Rust] Build directory is not passed to ci/scripts/rust_test.sh > --- > > Key: ARROW-7361 > URL: https://issues.apache.org/jira/browse/ARROW-7361 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > See build https://github.com/apache/arrow/runs/340751277 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7366) [C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery
[ https://issues.apache.org/jira/browse/ARROW-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7366: -- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery > -- > > Key: ARROW-7366 > URL: https://issues.apache.org/jira/browse/ARROW-7366 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Dataset >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > > https://github.com/apache/arrow/pull/5950 introduces > {{PartitionSchemeDiscovery}}, but ideally it would be supplied as an option > to data source discovery and the partition schema automatically discovered > based on the file paths accumulated then. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7367) [Python] Use np.full instead of np.array.repeat in ParquetDatasetPiece
[ https://issues.apache.org/jira/browse/ARROW-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7367: -- Labels: pull-request-available (was: ) > [Python] Use np.full instead of np.array.repeat in ParquetDatasetPiece > -- > > Key: ARROW-7367 > URL: https://issues.apache.org/jira/browse/ARROW-7367 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Xavier Lacroze >Priority: Trivial > Labels: pull-request-available > > For small tables (len < 100) execution time is slightly degraded (~ x1.4 at > len = 10), for large ones performance gain is huge (exec time ~ x0.04 at len > = 100_000) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7367) [Python] Use np.full instead of np.array.repeat in ParquetDatasetPiece
[ https://issues.apache.org/jira/browse/ARROW-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xavier Lacroze updated ARROW-7367: -- Summary: [Python] Use np.full instead of np.array.repeat in ParquetDatasetPiece (was: Use np.full instead of np.array.repeat in ParquetDatasetPiece) > [Python] Use np.full instead of np.array.repeat in ParquetDatasetPiece > -- > > Key: ARROW-7367 > URL: https://issues.apache.org/jira/browse/ARROW-7367 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Xavier Lacroze >Priority: Trivial > > For small tables (len < 100) execution time is slightly degraded (~ x1.4 at > len = 10), for large ones performance gain is huge (exec time ~ x0.04 at len > = 100_000) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7367) Use np.full instead of np.array.repeat in ParquetDatasetPiece
Xavier Lacroze created ARROW-7367: - Summary: Use np.full instead of np.array.repeat in ParquetDatasetPiece Key: ARROW-7367 URL: https://issues.apache.org/jira/browse/ARROW-7367 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Xavier Lacroze For small tables (len < 100) execution time is slightly degraded (~ x1.4 at len = 10), for large ones performance gain is huge (exec time ~ x0.04 at len = 100_000) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7366) [C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery
Ben Kietzman created ARROW-7366: --- Summary: [C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery Key: ARROW-7366 URL: https://issues.apache.org/jira/browse/ARROW-7366 Project: Apache Arrow Issue Type: New Feature Components: C++ - Dataset Reporter: Ben Kietzman Assignee: Ben Kietzman https://github.com/apache/arrow/pull/5950 introduces {{PartitionSchemeDiscovery}}, but ideally it would be supplied as an option to data source discovery and the partition schema automatically discovered based on the file paths accumulated then. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7343) [Java] Memory leak in Flight DoGet when client cancels
[ https://issues.apache.org/jira/browse/ARROW-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992701#comment-16992701 ] David Li commented on ARROW-7343: - I've implemented the fix in the linked PR. I also found that DoPut could leak for different reasons (we call gRPC methods that can throw, causing us to skip cleanup), which I've fixed. > [Java] Memory leak in Flight DoGet when client cancels > -- > > Key: ARROW-7343 > URL: https://issues.apache.org/jira/browse/ARROW-7343 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC, Java >Affects Versions: 0.14.0 >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > I believe this causes things like ARROW-4765. > -If a stream is interrupted or otherwise not drained by the client, the > serialized form of the ArrowMessage (DrainableByteBufInputStream) will sit > around forever, leaking memory.- -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets
[ https://issues.apache.org/jira/browse/ARROW-7362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7362: -- Labels: pull-request-available (was: ) > [Python] ListArray.flatten() should take care of slicing offsets > > > Key: ARROW-7362 > URL: https://issues.apache.org/jira/browse/ARROW-7362 > Project: Apache Arrow > Issue Type: Bug >Reporter: Zhuo Peng >Assignee: Zhuo Peng >Priority: Minor > Labels: pull-request-available > > Currently ListArray.flatten() simply returns the child array. If a ListArray > is a slice of another ListArray, they will share the same child array, > however the expected behavior (I think) of flatten() should be returning an > Array that's a concatenation of all the sub-lists in the ListArray, so the > slicing offset should be taken into account. > > For example: > a = pa.array([[1], [2], [3]]) > assert a.flatten().equals(pa.array([1,2,3])) > # expected: > a.slice(1).flatten().equals(pa.array([2, 3])) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets
[ https://issues.apache.org/jira/browse/ARROW-7362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992636#comment-16992636 ] Joris Van den Bossche commented on ARROW-7362: -- Another option could be to adjust the offsets so they point into the sliced values. But, this would then not be a zero-copy access of the offsets, which probably makes it a bad idea. > [Python] ListArray.flatten() should take care of slicing offsets > > > Key: ARROW-7362 > URL: https://issues.apache.org/jira/browse/ARROW-7362 > Project: Apache Arrow > Issue Type: Bug >Reporter: Zhuo Peng >Assignee: Zhuo Peng >Priority: Minor > > Currently ListArray.flatten() simply returns the child array. If a ListArray > is a slice of another ListArray, they will share the same child array, > however the expected behavior (I think) of flatten() should be returning an > Array that's a concatenation of all the sub-lists in the ListArray, so the > slicing offset should be taken into account. > > For example: > a = pa.array([[1], [2], [3]]) > assert a.flatten().equals(pa.array([1,2,3])) > # expected: > a.slice(1).flatten().equals(pa.array([2, 3])) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets
[ https://issues.apache.org/jira/browse/ARROW-7362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992631#comment-16992631 ] Joris Van den Bossche commented on ARROW-7362: -- Yes, the main thing is that {{offsets}} and one of {{values}}/{{flatten()}} need to match. Currently I implemented {{offsets}} such that they are sliced themselves but point into the unsliced values. > [Python] ListArray.flatten() should take care of slicing offsets > > > Key: ARROW-7362 > URL: https://issues.apache.org/jira/browse/ARROW-7362 > Project: Apache Arrow > Issue Type: Bug >Reporter: Zhuo Peng >Assignee: Zhuo Peng >Priority: Minor > > Currently ListArray.flatten() simply returns the child array. If a ListArray > is a slice of another ListArray, they will share the same child array, > however the expected behavior (I think) of flatten() should be returning an > Array that's a concatenation of all the sub-lists in the ListArray, so the > slicing offset should be taken into account. > > For example: > a = pa.array([[1], [2], [3]]) > assert a.flatten().equals(pa.array([1,2,3])) > # expected: > a.slice(1).flatten().equals(pa.array([2, 3])) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets
[ https://issues.apache.org/jira/browse/ARROW-7362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992628#comment-16992628 ] Wes McKinney commented on ARROW-7362: - I can't remember what was the argument before (maybe I was making it, sorry), but I think it would be OK for {{flatten()}} to return the sliced values, while {{.values}} does need to return the unsliced values I think. As long as an appropriate caveat is added to the docstring to say that the offsets should not be used (for random access purposes) with the result of {{flatten()}} > [Python] ListArray.flatten() should take care of slicing offsets > > > Key: ARROW-7362 > URL: https://issues.apache.org/jira/browse/ARROW-7362 > Project: Apache Arrow > Issue Type: Bug >Reporter: Zhuo Peng >Assignee: Zhuo Peng >Priority: Minor > > Currently ListArray.flatten() simply returns the child array. If a ListArray > is a slice of another ListArray, they will share the same child array, > however the expected behavior (I think) of flatten() should be returning an > Array that's a concatenation of all the sub-lists in the ListArray, so the > slicing offset should be taken into account. > > For example: > a = pa.array([[1], [2], [3]]) > assert a.flatten().equals(pa.array([1,2,3])) > # expected: > a.slice(1).flatten().equals(pa.array([2, 3])) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7227) [Python] Provide wrappers for ConcatenateWithPromotion()
[ https://issues.apache.org/jira/browse/ARROW-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-7227: Component/s: Python > [Python] Provide wrappers for ConcatenateWithPromotion() > > > Key: ARROW-7227 > URL: https://issues.apache.org/jira/browse/ARROW-7227 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Zhuo Peng >Assignee: Zhuo Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > [https://github.com/apache/arrow/pull/5534] Introduced > ConcatenateWithPromotion() to C++. Provide a Python wrapper for it. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7227) [Python] Provide wrappers for ConcatenateWithPromotion()
[ https://issues.apache.org/jira/browse/ARROW-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-7227. - Resolution: Fixed Issue resolved by pull request 5804 [https://github.com/apache/arrow/pull/5804] > [Python] Provide wrappers for ConcatenateWithPromotion() > > > Key: ARROW-7227 > URL: https://issues.apache.org/jira/browse/ARROW-7227 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Zhuo Peng >Assignee: Zhuo Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [https://github.com/apache/arrow/pull/5534] Introduced > ConcatenateWithPromotion() to C++. Provide a Python wrapper for it. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet
[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992615#comment-16992615 ] Wes McKinney commented on ARROW-7305: - There may be some things we could do about this, do you have an example file we could use to help with profiling the internal memory allocations during the write process? > [Python] High memory usage writing pyarrow.Table with large strings to parquet > -- > > Key: ARROW-7305 > URL: https://issues.apache.org/jira/browse/ARROW-7305 > Project: Apache Arrow > Issue Type: Task > Components: Python >Affects Versions: 0.15.1 > Environment: Mac OSX >Reporter: Bogdan Klichuk >Priority: Major > Labels: parquet > > My case of datasets stored is specific. I have large strings (1-100MB each). > Let's take for example a single row. > 43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string. > When I read this csv with pandas and then dump to parquet, my script consumes > 10x of the 43mb. > With increasing amount of such rows memory footprint overhead diminishes, but > I want to focus on this specific case. > Here's the footprint after running using memory profiler: > {code:java} > Line #Mem usageIncrement Line Contents > > 4 48.9 MiB 48.9 MiB @profile > 5 def test(): > 6143.7 MiB 94.7 MiB data = pd.read_csv('43mb.csv') > 7498.6 MiB354.9 MiB data.to_parquet('out.parquet') > {code} > Is this typical for parquet in case of big strings? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6775) Proposal for several Array utility functions
[ https://issues.apache.org/jira/browse/ARROW-6775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992607#comment-16992607 ] Joris Van den Bossche commented on ARROW-6775: -- [~brillsp] thanks for opening the issue, and sorry for the slow reply. I would recommend to open specific issues for the different items you mention (or after some more feedback, if we think they would be good to add). {quote}1/ ListLengthFromListArray(ListArray&): Returns lengths of lists in a ListArray, as a Int32Array (or Int64Array for large lists). For example:{quote} This can relatively easy be calculated from the offsets, I think? (and the offsets are now exposed in python) {quote}3/ GetArrayNullBitmapAsByteArray(Array&): Returns the array's null bitmap as a UInt8Array (which can be efficiently converted to a bool numpy array){quote} I think this is certainly something we want to add somehow. This also related to exposing a "IsNull" that returns a BooleaArray from the bitmap, see ARROW-971 and discussion in the PR. Maybe an utility to convert the bitmap to BooleanArray is more general, as the conversion for bitmap BooleanArray to bool/int8 numpy array is already implemented. {quote}4/ GetFlattenedArrayParentIndices(ListArray&) Makes a int32 array of the same length as the flattened ListArray. returned_array[i] == j means i-th element in the flattened ListArray came from j-th list in the ListArray. For example [[1,2,3], [], None, [4,5]] => [0, 0, 0, 3, 3]{quote} Can you explain this one a bit more? > Proposal for several Array utility functions > > > Key: ARROW-6775 > URL: https://issues.apache.org/jira/browse/ARROW-6775 > Project: Apache Arrow > Issue Type: Wish >Reporter: Zhuo Peng >Priority: Minor > > Hi, > We developed several utilities that computes / accesses certain properties of > Arrays and wonder if they make sense to get them into the upstream (into both > the C++ API and pyarrow) and assuming yes, where is the best place to put > them? > Maybe I have overlooked existing APIs that already do the same.. in that case > please point out. > > 1/ ListLengthFromListArray(ListArray&) > Returns lengths of lists in a ListArray, as a Int32Array (or Int64Array for > large lists). For example: > [[1, 2, 3], [], None] => [3, 0, 0] (or [3, 0, None], but we hope the returned > array can be converted to numpy) > > 2/ GetBinaryArrayTotalByteSize(BinaryArray&) > Returns the total byte size of a BinaryArray (basically offset[len - 1] - > offset[0]). > Alternatively, a BinaryArray::Flatten() -> Uint8Array would work. > > 3/ GetArrayNullBitmapAsByteArray(Array&) > Returns the array's null bitmap as a UInt8Array (which can be efficiently > converted to a bool numpy array) > > 4/ GetFlattenedArrayParentIndices(ListArray&) > Makes a int32 array of the same length as the flattened ListArray. > returned_array[i] == j means i-th element in the flattened ListArray came > from j-th list in the ListArray. > For example [[1,2,3], [], None, [4,5]] => [0, 0, 0, 3, 3] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5303) [Rust] Add SIMD vectorization of numeric casts
[ https://issues.apache.org/jira/browse/ARROW-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992598#comment-16992598 ] Andy Thomason commented on ARROW-5303: -- It can be quite daunting. I'm happy to help with understanding the asm. I spent seven years teaching it to game programmers! I'm also quite old and grew up in a time when you wrote the instructions out by hand in hex. Matt's website is a godsend (to use a horrible pun). {code:java} .LBB0_5: vpmovzxbd ymm0, qword ptr [rdx + rcx] vpmovzxbd ymm1, qword ptr [rdx + rcx + 8] vpmovzxbd ymm2, qword ptr [rdx + rcx + 16] vpmovzxbd ymm3, qword ptr [rdx + rcx + 24] vmovdqu ymmword ptr [rdi + 4*rcx], ymm0 vmovdqu ymmword ptr [rdi + 4*rcx + 32], ymm1 vmovdqu ymmword ptr [rdi + 4*rcx + 64], ymm2 vmovdqu ymmword ptr [rdi + 4*rcx + 96], ymm3 add rcx, 32 cmp rax, rcx jne .LBB0_5 {code} The first instruction "vpmovzxbd" loads and converts 8 bytes of u8 to 32 bytes of u32. The second instruction "vmovdqu" does an unaligned store of the value to 32 bytes of memory. Note that the index goes up by 8 and 32 in each case. The last two instructions are just the loop management. The instructions themselves have almost zero cost, but writing the data out through the cache could be very expensive. The thing to look for here is lots of ymm or zmm regsiters and counters going up in large increments. You don't need to know every instruction, but this kind of pattern (four loads, four stores, loop) is about as good as it gets. The loads occur in groups of four because there is a large latency on every instruction. We can start lots of them per cycle but it will take many cycles to get the data to RAM. Think of it as a production line with people fetching data from a warehouse and putting it on a conveyor belt and then taking it off and carrying it to another warehouse. The conveyor belts can be quite long, but we can put lots of data on the belt at the same time. > [Rust] Add SIMD vectorization of numeric casts > -- > > Key: ARROW-5303 > URL: https://issues.apache.org/jira/browse/ARROW-5303 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 0.13.0 >Reporter: Neville Dipale >Priority: Minor > > To improve the performance of cast kernels, we need SIMD support in numeric > casts. > An initial exploration shows that we can't trivially add SIMD casts between > our Arrow T::Simd types, because `packed_simd` only supports a cast between > T::Simd types that have the same number of lanes. > This means that adding casts from f64 to i64 (same lane length) satisfies the > bound trait `where TO::Simd : packed_simd::FromCast`, but f64 to > i32 (different lane length) doesn't. > We would benefit from investigating work-arounds to this limitation. Please > see > [github::nevi_me::arrow/\{branch:simd-cast}/../kernels/cast.rs|[https://github.com/nevi-me/arrow/blob/simd-cast/rust/arrow/src/compute/kernels/cast.rs#L601]] > for an example implementation that's limited by the differences in lane > length. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7350) [Python] Parquet file metadata min and max statistics not decoded from bytes for Decimal data types
[ https://issues.apache.org/jira/browse/ARROW-7350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992597#comment-16992597 ] Joris Van den Bossche commented on ARROW-7350: -- [~max.firman] Thanks for the report! Such a conversion would fit in the {{_box_logical_type_value}} function (https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L250-L294) that already handles conversion of raw value to python types for eg timestamps. I would only need to check if we have some conversion utility from bytes to Decimal already. > [Python] Parquet file metadata min and max statistics not decoded from bytes > for Decimal data types > --- > > Key: ARROW-7350 > URL: https://issues.apache.org/jira/browse/ARROW-7350 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: Max Firman >Priority: Major > > Parquet file metadata for Decimal type columns contain min and max values > that are not decoded from bytes into Decimals. This causes issues in > dependent libraries like Dask (see > [https://github.com/dask/dask/issues/5647]). > > {code:python|title=Reproducible example|borderStyle=solid} > from decimal import Decimal > import random > import pandas as pd > import pyarrow.parquet as pq > import pyarrow as pa > NUM_DATA_POINTS_PER_PARTITION = 25 > random.seed(0) > data1 = [{"col1": Decimal(f"{random.randint(0, 999)}.{random.randint(0, > 99)}")} for i in range(NUM_DATA_POINTS_PER_PARTITION)] > df = pd.DataFrame(data1) > table = pa.Table.from_pandas(df) > pq.write_table(table, 'my_data.parquet') > parquet_file = pq.ParquetFile('my_data.parquet') > assert > isinstance(parquet_file.metadata.row_group(0).column(0).statistics.min, > Decimal) # <-- AssertionError here because min has type bytes rather than > Decimal > assert > isinstance(parquet_file.metadata.row_group(0).column(0).statistics.max, > Decimal) > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7336) [C++] Implement MinMax options to not skip nulls
[ https://issues.apache.org/jira/browse/ARROW-7336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7336: - Summary: [C++] Implement MinMax options to not skip nulls (was: implement minmax options) > [C++] Implement MinMax options to not skip nulls > > > Key: ARROW-7336 > URL: https://issues.apache.org/jira/browse/ARROW-7336 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Compute >Reporter: Yuan Zhou >Assignee: Yuan Zhou >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > minmax kernel has MinMaxOptions but not used -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5303) [Rust] Add SIMD vectorization of numeric casts
[ https://issues.apache.org/jira/browse/ARROW-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992584#comment-16992584 ] Andy Thomason commented on ARROW-5303: -- "if" statements (or match) in loops do not always have a happy ending. I would just do the cast, regardless of the validity and copy the null bitmap from the source to the destination. In theory, you should disregard the value of a null data item. Vec> is never going to be efficient as it takes at leas 2n bytes per element because of alignment and has terrible access patterns. > [Rust] Add SIMD vectorization of numeric casts > -- > > Key: ARROW-5303 > URL: https://issues.apache.org/jira/browse/ARROW-5303 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 0.13.0 >Reporter: Neville Dipale >Priority: Minor > > To improve the performance of cast kernels, we need SIMD support in numeric > casts. > An initial exploration shows that we can't trivially add SIMD casts between > our Arrow T::Simd types, because `packed_simd` only supports a cast between > T::Simd types that have the same number of lanes. > This means that adding casts from f64 to i64 (same lane length) satisfies the > bound trait `where TO::Simd : packed_simd::FromCast`, but f64 to > i32 (different lane length) doesn't. > We would benefit from investigating work-arounds to this limitation. Please > see > [github::nevi_me::arrow/\{branch:simd-cast}/../kernels/cast.rs|[https://github.com/nevi-me/arrow/blob/simd-cast/rust/arrow/src/compute/kernels/cast.rs#L601]] > for an example implementation that's limited by the differences in lane > length. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas
[ https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7365: - Fix Version/s: 1.0.0 > [Python] Support FixedSizeList type in conversion to numpy/pandas > - > > Key: ARROW-7365 > URL: https://issues.apache.org/jira/browse/ARROW-7365 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 1.0.0 > > > Follow-up on ARROW-7261, still need to add support for FixedSizeListType in > the arrow -> python conversion (arrow_to_pandas.cc) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas
Joris Van den Bossche created ARROW-7365: Summary: [Python] Support FixedSizeList type in conversion to numpy/pandas Key: ARROW-7365 URL: https://issues.apache.org/jira/browse/ARROW-7365 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Follow-up on ARROW-7261, still need to add support for FixedSizeListType in the arrow -> python conversion (arrow_to_pandas.cc) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7261) [Python] Python support for fixed size list type
[ https://issues.apache.org/jira/browse/ARROW-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-7261: - Assignee: Joris Van den Bossche > [Python] Python support for fixed size list type > > > Key: ARROW-7261 > URL: https://issues.apache.org/jira/browse/ARROW-7261 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > I didn't see any issue about this, but {{FixedSizeListArray}} (ARROW-1280) is > not yet exposed in Python. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7261) [Python] Python support for fixed size list type
[ https://issues.apache.org/jira/browse/ARROW-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-7261. --- Resolution: Fixed Issue resolved by pull request 5906 [https://github.com/apache/arrow/pull/5906] > [Python] Python support for fixed size list type > > > Key: ARROW-7261 > URL: https://issues.apache.org/jira/browse/ARROW-7261 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > I didn't see any issue about this, but {{FixedSizeListArray}} (ARROW-1280) is > not yet exposed in Python. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets
[ https://issues.apache.org/jira/browse/ARROW-7362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992524#comment-16992524 ] Joris Van den Bossche commented on ARROW-7362: -- There was some discussion about this in ARROW-7031: https://github.com/apache/arrow/pull/5759, where it was said to not slice the values. Personally, I think it would be nice to have easy python access to the sliced values as well, but I also find it somewhat confusing to have {{.flatten()}} and {{.values}} differ. > [Python] ListArray.flatten() should take care of slicing offsets > > > Key: ARROW-7362 > URL: https://issues.apache.org/jira/browse/ARROW-7362 > Project: Apache Arrow > Issue Type: Bug >Reporter: Zhuo Peng >Assignee: Zhuo Peng >Priority: Minor > > Currently ListArray.flatten() simply returns the child array. If a ListArray > is a slice of another ListArray, they will share the same child array, > however the expected behavior (I think) of flatten() should be returning an > Array that's a concatenation of all the sub-lists in the ListArray, so the > slicing offset should be taken into account. > > For example: > a = pa.array([[1], [2], [3]]) > assert a.flatten().equals(pa.array([1,2,3])) > # expected: > a.slice(1).flatten().equals(pa.array([2, 3])) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7041) [Python] PythonLibs setting found by CMake uses wrong version of Python on macOS
[ https://issues.apache.org/jira/browse/ARROW-7041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7041: - Component/s: Python > [Python] PythonLibs setting found by CMake uses wrong version of Python on > macOS > > > Key: ARROW-7041 > URL: https://issues.apache.org/jira/browse/ARROW-7041 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Hudon >Priority: Major > > I'm trying to build the Python library and run its tests, so to do that I > need to first build the C++ library. I'm going through the Python Development > Guide part of the docs. When invoking CMake to build the C++ library, it > claims to have found PythonLibs here: > -- Found PythonLibs: > /usr/local/Cellar/python@2/2.7.16_1/Frameworks/Python.framework/Versions/2.7/lib/libpython3.7m.dylib > Just by looking at the whole path, it doesn't look like a promising location. > And indeed, there's no libpython3.7*.dylib file in the Python 2.7 install > directory. So the C++ build fails. > I'm on macOS 10.14.6. I have Python 2.7 and 3.7 both installed via Homebrew. > (There is a libpython3.7*.dylib file in the Python 3.7 install of Homebrew.) > For the Python build dependencies, I have a Python 3.7 venv active and they > are installed there via pip. This happens with -DARROW_PYTHON=ON. > This definitely looks like whatever piece of CMake code that is trying to > find PythonLibs is grabbing the first directory it finds, and appending a > path to the dylib without looking if it exists. However, I don't know much of > anything about CMake. Any suggestion for a fix or at least a workaround to > point CMake to the PythonLibs directory that would make more sense? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7266) [Python] dictionary_encode() of a slice gives wrong result
[ https://issues.apache.org/jira/browse/ARROW-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7266: - Fix Version/s: 1.0.0 > [Python] dictionary_encode() of a slice gives wrong result > -- > > Key: ARROW-7266 > URL: https://issues.apache.org/jira/browse/ARROW-7266 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.1 > Environment: Docker on Linux 5.2.18-200.fc30.x86_64; Python 3.7.4 >Reporter: Adam Hooper >Priority: Major > Fix For: 1.0.0 > > > Steps to reproduce: > {code:python} > import pyarrow as pa > arr = pa.array(["a", "b", "b", "b"])[1:] > arr.dictionary_encode() > {code} > Expected results: > {code} > -- dictionary: > [ > "b" > ] > -- indices: > [ > 0, > 0, > 0 > ] > {code} > Actual results: > {code} > -- dictionary: > [ > "b", > "" > ] > -- indices: > [ > 0, > 0, > 1 > ] > {code} > I don't know a workaround. Converting to pylist and back is too slow. Is > there a way to copy the slice to a new offset-0 StringArray that I could then > dictionary-encode? Otherwise, I'm considering building buffers by hand -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7363) [Python] flatten() doesn't work on ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-7363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992519#comment-16992519 ] Joris Van den Bossche commented on ARROW-7363: -- >From looking at the code, I_think_ that the ChunkedArray {{flatten()}} method >maps to the StructArray.flatten() method, and not to the ListArray.flatten() >method. StructArray and ListArray implement (somewhat unfortunately maybe) a different flatten method: for StructArray it returns a list of arrays (returning one individual array for each field in the struct), while ListArray returns a new Array with one level of nesting reduced (list array -> array, or list of list array -> list array, ..). I am not fully sure how to deal with this. Should ChunkedArray.flatten do something different depending on the type? (but it's also not nice that the type of return is then variable) Should be rename the {{flatten()}} method for ListArrays ? > [Python] flatten() doesn't work on ChunkedArray > --- > > Key: ARROW-7363 > URL: https://issues.apache.org/jira/browse/ARROW-7363 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: marc abboud >Priority: Major > > Flatten() doesn't work on ChunkedArray. It returns only the ChunkedArray in a > list without flattening anything. > {code:java} > // code placeholder > aa = pa.array([[1],[2]]) > bb = pa.chunked_array([aa,aa]) > > bb.flatten() > Out[15]: > [ [ [ [ 1 ], [ 2 ] ], [ [ 1 ], [ 2 ] ] ]] > Expected: > [ [ 1, 2 ], [ 1, 2 ] ] > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet
[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-7305: -- Comment: was deleted (was: Seems like its transformation of pandas to pyarrow.Table. If you transform manually {code:java} table = pyarrow.Table.from_pandas(data) {code} you'll see its the thing doing the memory spike, writing this table to parquet looks light.) > [Python] High memory usage writing pyarrow.Table with large strings to parquet > -- > > Key: ARROW-7305 > URL: https://issues.apache.org/jira/browse/ARROW-7305 > Project: Apache Arrow > Issue Type: Task > Components: Python >Affects Versions: 0.15.1 > Environment: Mac OSX >Reporter: Bogdan Klichuk >Priority: Major > Labels: parquet > > My case of datasets stored is specific. I have large strings (1-100MB each). > Let's take for example a single row. > 43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string. > When I read this csv with pandas and then dump to parquet, my script consumes > 10x of the 43mb. > With increasing amount of such rows memory footprint overhead diminishes, but > I want to focus on this specific case. > Here's the footprint after running using memory profiler: > {code:java} > Line #Mem usageIncrement Line Contents > > 4 48.9 MiB 48.9 MiB @profile > 5 def test(): > 6143.7 MiB 94.7 MiB data = pd.read_csv('43mb.csv') > 7498.6 MiB354.9 MiB data.to_parquet('out.parquet') > {code} > Is this typical for parquet in case of big strings? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7363) [Python] flatten() doesn't work on ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-7363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7363: - Component/s: Python > [Python] flatten() doesn't work on ChunkedArray > --- > > Key: ARROW-7363 > URL: https://issues.apache.org/jira/browse/ARROW-7363 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: marc abboud >Priority: Major > > Flatten() doesn't work on ChunkedArray. It returns only the ChunkedArray in a > list without flattening anything. > {code:java} > // code placeholder > aa = pa.array([[1],[2]]) > bb = pa.chunked_array([aa,aa]) > > bb.flatten() > Out[15]: > [ [ [ [ 1 ], [ 2 ] ], [ [ 1 ], [ 2 ] ] ]] > Expected: > [ [ 1, 2 ], [ 1, 2 ] ] > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7363) [Python] flatten() doesn't work on ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-7363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7363: - Summary: [Python] flatten() doesn't work on ChunkedArray (was: flatten() doesn't work on ChunkedArray) > [Python] flatten() doesn't work on ChunkedArray > --- > > Key: ARROW-7363 > URL: https://issues.apache.org/jira/browse/ARROW-7363 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.15.1 >Reporter: marc abboud >Priority: Major > > Flatten() doesn't work on ChunkedArray. It returns only the ChunkedArray in a > list without flattening anything. > {code:java} > // code placeholder > aa = pa.array([[1],[2]]) > bb = pa.chunked_array([aa,aa]) > > bb.flatten() > Out[15]: > [ [ [ [ 1 ], [ 2 ] ], [ [ 1 ], [ 2 ] ] ]] > Expected: > [ [ 1, 2 ], [ 1, 2 ] ] > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7364) [Rust] Add cast options to cast kernel
[ https://issues.apache.org/jira/browse/ARROW-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-7364: -- Component/s: Rust > [Rust] Add cast options to cast kernel > -- > > Key: ARROW-7364 > URL: https://issues.apache.org/jira/browse/ARROW-7364 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Neville Dipale >Priority: Major > > The cast kernels currently do not take explicit options, but instead convert > overflows and invalid uft8 to nulls. We can create options that customise the > behaviour, similarly to CastOptions in CPP > ([https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.h#L38]) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7364) [Rust] Add cast options to cast kernel
Neville Dipale created ARROW-7364: - Summary: [Rust] Add cast options to cast kernel Key: ARROW-7364 URL: https://issues.apache.org/jira/browse/ARROW-7364 Project: Apache Arrow Issue Type: Improvement Reporter: Neville Dipale The cast kernels currently do not take explicit options, but instead convert overflows and invalid uft8 to nulls. We can create options that customise the behaviour, similarly to CastOptions in CPP ([https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.h#L38]) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7363) flatten() doesn't work on ChunkedArray
marc abboud created ARROW-7363: -- Summary: flatten() doesn't work on ChunkedArray Key: ARROW-7363 URL: https://issues.apache.org/jira/browse/ARROW-7363 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.15.1 Reporter: marc abboud Flatten() doesn't work on ChunkedArray. It returns only the ChunkedArray in a list without flattening anything. {code:java} // code placeholder aa = pa.array([[1],[2]]) bb = pa.chunked_array([aa,aa]) bb.flatten() Out[15]: [ [ [ [ 1 ], [ 2 ] ], [ [ 1 ], [ 2 ] ] ]] Expected: [ [ 1, 2 ], [ 1, 2 ] ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)