date:20191210

[jira] [Updated] (ARROW-7370) [C++] Old Protobuf with AUTO detection is failed

2019-12-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7370:
--
Labels: pull-request-available  (was: )

> [C++] Old Protobuf with AUTO detection is failed
> 
>
> Key: ARROW-7370
> URL: https://issues.apache.org/jira/browse/ARROW-7370
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>
> {noformat}
> -- Could NOT find Protobuf: Found unsuitable version "3.6.1", but required is 
> at least "3.7.0" (found /usr/lib/x86_64-linux-gnu/libprotobuf.so;-pthread)
> Building Protocol Buffers from source
> CMake Error at cmake_modules/ThirdpartyToolchain.cmake:1179 (add_library):
>   add_library cannot create imported target "protobuf::libprotobuf" because
>   another target with the same name already exists.
> Call Stack (most recent call first):
>   cmake_modules/ThirdpartyToolchain.cmake:147 (build_protobuf)
>   cmake_modules/ThirdpartyToolchain.cmake:178 (build_dependency)
>   cmake_modules/ThirdpartyToolchain.cmake:1204 
> (resolve_dependency_with_version)
>   CMakeLists.txt:428 (include)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7370) [C++] Old Protobuf with AUTO detection is failed

2019-12-10 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-7370:
---

 Summary: [C++] Old Protobuf with AUTO detection is failed
 Key: ARROW-7370
 URL: https://issues.apache.org/jira/browse/ARROW-7370
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


{noformat}
-- Could NOT find Protobuf: Found unsuitable version "3.6.1", but required is 
at least "3.7.0" (found /usr/lib/x86_64-linux-gnu/libprotobuf.so;-pthread)
Building Protocol Buffers from source
CMake Error at cmake_modules/ThirdpartyToolchain.cmake:1179 (add_library):
  add_library cannot create imported target "protobuf::libprotobuf" because
  another target with the same name already exists.
Call Stack (most recent call first):
  cmake_modules/ThirdpartyToolchain.cmake:147 (build_protobuf)
  cmake_modules/ThirdpartyToolchain.cmake:178 (build_dependency)
  cmake_modules/ThirdpartyToolchain.cmake:1204 (resolve_dependency_with_version)
  CMakeLists.txt:428 (include)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7369) [GLib] Add garrow_table_combine_chunks

2019-12-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7369:
--
Labels: pull-request-available  (was: )

> [GLib] Add garrow_table_combine_chunks
> --
>
> Key: ARROW-7369
> URL: https://issues.apache.org/jira/browse/ARROW-7369
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Kenta Murata
>Assignee: Kenta Murata
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7369) [GLib] Add garrow_table_combine_chunks

2019-12-10 Thread Kenta Murata (Jira)

Kenta Murata created ARROW-7369:
---

 Summary: [GLib] Add garrow_table_combine_chunks
 Key: ARROW-7369
 URL: https://issues.apache.org/jira/browse/ARROW-7369
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7272) [C++][Java] JNI bridge between RecordBatch and VectorSchemaRoot

2019-12-10 Thread Hongze Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993151#comment-16993151
 ] 

Hongze Zhang commented on ARROW-7272:
-

Hi guys, would you suggest to just use the existing 
*org.apache.arrow.vector.ipc.ArrowReader*? We already have a similar approach 
in orc adaptor and it works fine. As schemas in Datasets API are always 
predefined I think we don't have to convert the schema everytime.

> [C++][Java] JNI bridge between RecordBatch and VectorSchemaRoot
> ---
>
> Key: ARROW-7272
> URL: https://issues.apache.org/jira/browse/ARROW-7272
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Java
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> Given a C++ std::shared_ptr, retrieve it in java as a 
> VectorSchemaRoot class. Gandiva already offer a similar facility but with raw 
> buffers. It would be convenient if users could call C++ that yields 
> RecordBatch and retrieve it in a seamless fashion.
> This would remove one roadblock of using C++ dataset facility in Java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7368) [Ruby] Use :arrow_file and :arrow_stream for format name

2019-12-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7368:
--
Labels: pull-request-available  (was: )

> [Ruby] Use :arrow_file and :arrow_stream for format name
> 
>
> Key: ARROW-7368
> URL: https://issues.apache.org/jira/browse/ARROW-7368
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7368) [Ruby] Use :arrow_file and :arrow_stream for format name

2019-12-10 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-7368:
---

 Summary: [Ruby] Use :arrow_file and :arrow_stream for format name
 Key: ARROW-7368
 URL: https://issues.apache.org/jira/browse/ARROW-7368
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Ruby
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6965) [C++][Dataset] Optionally expose partition keys as materialized columns

2019-12-10 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-6965.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5950
[https://github.com/apache/arrow/pull/5950]

> [C++][Dataset] Optionally expose partition keys as materialized columns
> ---
>
> Key: ARROW-6965
> URL: https://issues.apache.org/jira/browse/ARROW-6965
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> This would be exposed in the DataSourceDiscovery as an option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7361) [Rust] Build directory is not passed to ci/scripts/rust_test.sh

2019-12-10 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-7361.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6004
[https://github.com/apache/arrow/pull/6004]

> [Rust] Build directory is not passed to ci/scripts/rust_test.sh
> ---
>
> Key: ARROW-7361
> URL: https://issues.apache.org/jira/browse/ARROW-7361
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See build https://github.com/apache/arrow/runs/340751277



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7366) [C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery

2019-12-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7366:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery
> --
>
> Key: ARROW-7366
> URL: https://issues.apache.org/jira/browse/ARROW-7366
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Dataset
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
>
> https://github.com/apache/arrow/pull/5950 introduces 
> {{PartitionSchemeDiscovery}}, but ideally it would be supplied as an option 
> to data source discovery and the partition schema automatically discovered 
> based on the file paths accumulated then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7367) [Python] Use np.full instead of np.array.repeat in ParquetDatasetPiece

2019-12-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7367:
--
Labels: pull-request-available  (was: )

> [Python] Use np.full instead of np.array.repeat in ParquetDatasetPiece
> --
>
> Key: ARROW-7367
> URL: https://issues.apache.org/jira/browse/ARROW-7367
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Xavier Lacroze
>Priority: Trivial
>  Labels: pull-request-available
>
> For small tables (len < 100) execution time is slightly degraded (~ x1.4 at 
> len = 10), for large ones performance gain is huge (exec time ~ x0.04 at len 
> = 100_000)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7367) [Python] Use np.full instead of np.array.repeat in ParquetDatasetPiece

2019-12-10 Thread Xavier Lacroze (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xavier Lacroze updated ARROW-7367:
--
Summary: [Python] Use np.full instead of np.array.repeat in 
ParquetDatasetPiece  (was: Use np.full instead of np.array.repeat in 
ParquetDatasetPiece)

> [Python] Use np.full instead of np.array.repeat in ParquetDatasetPiece
> --
>
> Key: ARROW-7367
> URL: https://issues.apache.org/jira/browse/ARROW-7367
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Xavier Lacroze
>Priority: Trivial
>
> For small tables (len < 100) execution time is slightly degraded (~ x1.4 at 
> len = 10), for large ones performance gain is huge (exec time ~ x0.04 at len 
> = 100_000)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7367) Use np.full instead of np.array.repeat in ParquetDatasetPiece

2019-12-10 Thread Xavier Lacroze (Jira)

Xavier Lacroze created ARROW-7367:
-

 Summary: Use np.full instead of np.array.repeat in 
ParquetDatasetPiece
 Key: ARROW-7367
 URL: https://issues.apache.org/jira/browse/ARROW-7367
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Xavier Lacroze


For small tables (len < 100) execution time is slightly degraded (~ x1.4 at len 
= 10), for large ones performance gain is huge (exec time ~ x0.04 at len = 
100_000)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7366) [C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery

2019-12-10 Thread Ben Kietzman (Jira)

Ben Kietzman created ARROW-7366:
---

 Summary: [C++][Dataset] Use PartitionSchemeDiscovery in 
DataSourceDiscovery
 Key: ARROW-7366
 URL: https://issues.apache.org/jira/browse/ARROW-7366
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Dataset
Reporter: Ben Kietzman
Assignee: Ben Kietzman


https://github.com/apache/arrow/pull/5950 introduces 
{{PartitionSchemeDiscovery}}, but ideally it would be supplied as an option to 
data source discovery and the partition schema automatically discovered based 
on the file paths accumulated then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7343) [Java] Memory leak in Flight DoGet when client cancels

2019-12-10 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992701#comment-16992701
 ] 

David Li commented on ARROW-7343:
-

I've implemented the fix in the linked PR. I also found that DoPut could leak 
for different reasons (we call gRPC methods that can throw, causing us to skip 
cleanup), which I've fixed.

> [Java] Memory leak in Flight DoGet when client cancels
> --
>
> Key: ARROW-7343
> URL: https://issues.apache.org/jira/browse/ARROW-7343
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Java
>Affects Versions: 0.14.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I believe this causes things like ARROW-4765.
> -If a stream is interrupted or otherwise not drained by the client, the 
> serialized form of the ArrowMessage (DrainableByteBufInputStream) will sit 
> around forever, leaking memory.-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets

2019-12-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7362:
--
Labels: pull-request-available  (was: )

> [Python] ListArray.flatten() should take care of slicing offsets
> 
>
> Key: ARROW-7362
> URL: https://issues.apache.org/jira/browse/ARROW-7362
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Minor
>  Labels: pull-request-available
>
> Currently ListArray.flatten() simply returns the child array. If a ListArray 
> is a slice of another ListArray, they will share the same child array, 
> however the expected behavior (I think) of flatten() should be returning an 
> Array that's a concatenation of all the sub-lists in the ListArray, so the 
> slicing offset should be taken into account.
>  
> For example:
> a = pa.array([[1], [2], [3]])
> assert a.flatten().equals(pa.array([1,2,3]))
> # expected:
> a.slice(1).flatten().equals(pa.array([2, 3]))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets

2019-12-10 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992636#comment-16992636
 ] 

Joris Van den Bossche commented on ARROW-7362:
--

Another option could be to adjust the offsets so they point into the sliced 
values. But, this would then not be a zero-copy access of the offsets, which 
probably makes it a bad idea.

> [Python] ListArray.flatten() should take care of slicing offsets
> 
>
> Key: ARROW-7362
> URL: https://issues.apache.org/jira/browse/ARROW-7362
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Minor
>
> Currently ListArray.flatten() simply returns the child array. If a ListArray 
> is a slice of another ListArray, they will share the same child array, 
> however the expected behavior (I think) of flatten() should be returning an 
> Array that's a concatenation of all the sub-lists in the ListArray, so the 
> slicing offset should be taken into account.
>  
> For example:
> a = pa.array([[1], [2], [3]])
> assert a.flatten().equals(pa.array([1,2,3]))
> # expected:
> a.slice(1).flatten().equals(pa.array([2, 3]))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets

2019-12-10 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992631#comment-16992631
 ] 

Joris Van den Bossche commented on ARROW-7362:
--

Yes, the main thing is that {{offsets}} and one of {{values}}/{{flatten()}} 
need to match. Currently I implemented {{offsets}} such that they are sliced 
themselves but point into the unsliced values.

> [Python] ListArray.flatten() should take care of slicing offsets
> 
>
> Key: ARROW-7362
> URL: https://issues.apache.org/jira/browse/ARROW-7362
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Minor
>
> Currently ListArray.flatten() simply returns the child array. If a ListArray 
> is a slice of another ListArray, they will share the same child array, 
> however the expected behavior (I think) of flatten() should be returning an 
> Array that's a concatenation of all the sub-lists in the ListArray, so the 
> slicing offset should be taken into account.
>  
> For example:
> a = pa.array([[1], [2], [3]])
> assert a.flatten().equals(pa.array([1,2,3]))
> # expected:
> a.slice(1).flatten().equals(pa.array([2, 3]))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets

2019-12-10 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992628#comment-16992628
 ] 

Wes McKinney commented on ARROW-7362:
-

I can't remember what was the argument before (maybe I was making it, sorry), 
but I think it would be OK for {{flatten()}} to return the sliced values, while 
{{.values}} does need to return the unsliced values I think. As long as an 
appropriate caveat is added to the docstring to say that the offsets should not 
be used (for random access purposes) with the result of {{flatten()}}

> [Python] ListArray.flatten() should take care of slicing offsets
> 
>
> Key: ARROW-7362
> URL: https://issues.apache.org/jira/browse/ARROW-7362
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Minor
>
> Currently ListArray.flatten() simply returns the child array. If a ListArray 
> is a slice of another ListArray, they will share the same child array, 
> however the expected behavior (I think) of flatten() should be returning an 
> Array that's a concatenation of all the sub-lists in the ListArray, so the 
> slicing offset should be taken into account.
>  
> For example:
> a = pa.array([[1], [2], [3]])
> assert a.flatten().equals(pa.array([1,2,3]))
> # expected:
> a.slice(1).flatten().equals(pa.array([2, 3]))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7227) [Python] Provide wrappers for ConcatenateWithPromotion()

2019-12-10 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7227:

Component/s: Python

> [Python] Provide wrappers for ConcatenateWithPromotion()
> 
>
> Key: ARROW-7227
> URL: https://issues.apache.org/jira/browse/ARROW-7227
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/pull/5534] Introduced 
> ConcatenateWithPromotion() to C++. Provide a Python wrapper for it.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7227) [Python] Provide wrappers for ConcatenateWithPromotion()

2019-12-10 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7227.
-
Resolution: Fixed

Issue resolved by pull request 5804
[https://github.com/apache/arrow/pull/5804]

> [Python] Provide wrappers for ConcatenateWithPromotion()
> 
>
> Key: ARROW-7227
> URL: https://issues.apache.org/jira/browse/ARROW-7227
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/pull/5534] Introduced 
> ConcatenateWithPromotion() to C++. Provide a Python wrapper for it.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet

2019-12-10 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992615#comment-16992615
 ] 

Wes McKinney commented on ARROW-7305:
-

There may be some things we could do about this, do you have an example file we 
could use to help with profiling the internal memory allocations during the 
write process? 

> [Python] High memory usage writing pyarrow.Table with large strings to parquet
> --
>
> Key: ARROW-7305
> URL: https://issues.apache.org/jira/browse/ARROW-7305
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Mac OSX
>Reporter: Bogdan Klichuk
>Priority: Major
>  Labels: parquet
>
> My case of datasets stored is specific. I have large strings (1-100MB each).
> Let's take for example a single row.
> 43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string.
> When I read this csv with pandas and then dump to parquet, my script consumes 
> 10x of the 43mb.
> With increasing amount of such rows memory footprint overhead diminishes, but 
> I want to focus on this specific case.
> Here's the footprint after running using memory profiler:
> {code:java}
> Line #Mem usageIncrement   Line Contents
> 
>  4 48.9 MiB 48.9 MiB   @profile
>  5 def test():
>  6143.7 MiB 94.7 MiB   data = pd.read_csv('43mb.csv')
>  7498.6 MiB354.9 MiB   data.to_parquet('out.parquet')
>  {code}
> Is this typical for parquet in case of big strings?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6775) Proposal for several Array utility functions

2019-12-10 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992607#comment-16992607
 ] 

Joris Van den Bossche commented on ARROW-6775:
--

[~brillsp] thanks for opening the issue, and sorry for the slow reply.

I would recommend to open specific issues for the different items you mention 
(or after some more feedback, if we think they would be good to add).

{quote}1/ ListLengthFromListArray(ListArray&): Returns lengths of lists in a 
ListArray, as a Int32Array (or Int64Array for large lists). For example:{quote}

This can relatively easy be calculated from the offsets, I think? (and the 
offsets are now exposed in python)

{quote}3/ GetArrayNullBitmapAsByteArray(Array&): Returns the array's null 
bitmap as a UInt8Array (which can be efficiently converted to a bool numpy 
array){quote}

I think this is certainly something we want to add somehow. This also related 
to exposing a "IsNull" that returns a BooleaArray from the bitmap, see 
ARROW-971 and discussion in the PR. 
Maybe an utility to convert the bitmap to BooleanArray is more general, as the 
conversion for bitmap BooleanArray to bool/int8 numpy array is already 
implemented.

{quote}4/ GetFlattenedArrayParentIndices(ListArray&)

Makes a int32 array of the same length as the flattened ListArray. 
returned_array[i] == j means i-th element in the flattened ListArray came from 
j-th list in the ListArray.

For example [[1,2,3], [], None, [4,5]] => [0, 0, 0, 3, 3]{quote}

Can you explain this one a bit more?

> Proposal for several Array utility functions
> 
>
> Key: ARROW-6775
> URL: https://issues.apache.org/jira/browse/ARROW-6775
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Zhuo Peng
>Priority: Minor
>
> Hi,
> We developed several utilities that computes / accesses certain properties of 
> Arrays and wonder if they make sense to get them into the upstream (into both 
> the C++ API and pyarrow) and assuming yes, where is the best place to put 
> them?
> Maybe I have overlooked existing APIs that already do the same.. in that case 
> please point out.
>  
> 1/ ListLengthFromListArray(ListArray&)
> Returns lengths of lists in a ListArray, as a Int32Array (or Int64Array for 
> large lists). For example:
> [[1, 2, 3], [], None] => [3, 0, 0] (or [3, 0, None], but we hope the returned 
> array can be converted to numpy)
>  
> 2/ GetBinaryArrayTotalByteSize(BinaryArray&)
> Returns the total byte size of a BinaryArray (basically offset[len - 1] - 
> offset[0]).
> Alternatively, a BinaryArray::Flatten() -> Uint8Array would work.
>  
> 3/ GetArrayNullBitmapAsByteArray(Array&)
> Returns the array's null bitmap as a UInt8Array (which can be efficiently 
> converted to a bool numpy array)
>  
> 4/ GetFlattenedArrayParentIndices(ListArray&)
> Makes a int32 array of the same length as the flattened ListArray. 
> returned_array[i] == j means i-th element in the flattened ListArray came 
> from j-th list in the ListArray.
> For example [[1,2,3], [], None, [4,5]] => [0, 0, 0, 3, 3]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5303) [Rust] Add SIMD vectorization of numeric casts

2019-12-10 Thread Andy Thomason (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992598#comment-16992598
 ] 

Andy Thomason commented on ARROW-5303:
--

It can be quite daunting. I'm happy to help with understanding the asm. I spent 
seven years teaching it to game programmers! I'm also quite old and grew up in 
a time when you wrote the instructions out by hand in hex.

Matt's website is a godsend (to use a horrible pun).
{code:java}
.LBB0_5:
vpmovzxbd   ymm0, qword ptr [rdx + rcx]
vpmovzxbd   ymm1, qword ptr [rdx + rcx + 8]
vpmovzxbd   ymm2, qword ptr [rdx + rcx + 16]
vpmovzxbd   ymm3, qword ptr [rdx + rcx + 24]
vmovdqu ymmword ptr [rdi + 4*rcx], ymm0
vmovdqu ymmword ptr [rdi + 4*rcx + 32], ymm1
vmovdqu ymmword ptr [rdi + 4*rcx + 64], ymm2
vmovdqu ymmword ptr [rdi + 4*rcx + 96], ymm3
add rcx, 32
cmp rax, rcx
jne .LBB0_5
{code}
The  first instruction "vpmovzxbd" loads and converts 8 bytes of u8 to 32 bytes 
of u32.

The second instruction "vmovdqu" does an unaligned store of the value to 32 
bytes of memory. Note that the index goes up by 8 and 32 in each case.

The last two instructions are just the loop management.

The instructions themselves have almost zero cost, but writing the data out 
through the cache could be very expensive.

The thing to look for here is lots of ymm or zmm regsiters and counters going 
up in large increments. You don't need to know every instruction, but this kind 
of pattern (four loads, four stores, loop) is about as good as it gets.

The loads occur in groups of four because there is a large latency on every 
instruction. We can start lots of them per cycle but it will take many cycles 
to get the data to RAM. Think of it as a production line with people fetching 
data from a warehouse and putting it on a conveyor belt and then taking it off 
and carrying it to another warehouse.

The conveyor belts can be quite long, but we can put lots of data on the belt 
at the same time.

 

 

> [Rust] Add SIMD vectorization of numeric casts
> --
>
> Key: ARROW-5303
> URL: https://issues.apache.org/jira/browse/ARROW-5303
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.13.0
>Reporter: Neville Dipale
>Priority: Minor
>
> To improve the performance of cast kernels, we need SIMD support in numeric 
> casts.
> An initial exploration shows that we can't trivially add SIMD casts between 
> our Arrow T::Simd types, because `packed_simd` only supports a cast between 
> T::Simd types that have the same number of lanes.
> This means that adding casts from f64 to i64 (same lane length) satisfies the 
> bound trait `where TO::Simd : packed_simd::FromCast`, but f64 to 
> i32 (different lane length) doesn't.
> We would benefit from investigating work-arounds to this limitation. Please 
> see 
> [github::nevi_me::arrow/\{branch:simd-cast}/../kernels/cast.rs|[https://github.com/nevi-me/arrow/blob/simd-cast/rust/arrow/src/compute/kernels/cast.rs#L601]]
>  for an example implementation that's limited by the differences in lane 
> length.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7350) [Python] Parquet file metadata min and max statistics not decoded from bytes for Decimal data types

2019-12-10 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992597#comment-16992597
 ] 

Joris Van den Bossche commented on ARROW-7350:
--

[~max.firman] Thanks for the report!

Such a conversion would fit in the {{_box_logical_type_value}} function 
(https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L250-L294)
 that already handles conversion of raw value to python types for eg timestamps.

I would only need to check if we have some conversion utility from bytes to 
Decimal already.

> [Python] Parquet file metadata min and max statistics not decoded from bytes 
> for Decimal data types
> ---
>
> Key: ARROW-7350
> URL: https://issues.apache.org/jira/browse/ARROW-7350
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Max Firman
>Priority: Major
>
> Parquet file metadata for Decimal type columns contain min and max values 
> that are not decoded from bytes into Decimals. This causes issues in 
> dependent libraries like Dask (see 
> [https://github.com/dask/dask/issues/5647]).
>  
> {code:python|title=Reproducible example|borderStyle=solid}
> from decimal import Decimal
> import random
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow as pa
> NUM_DATA_POINTS_PER_PARTITION = 25
> random.seed(0)
> data1 = [{"col1": Decimal(f"{random.randint(0, 999)}.{random.randint(0, 
> 99)}")} for i in range(NUM_DATA_POINTS_PER_PARTITION)]
> df = pd.DataFrame(data1)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'my_data.parquet')
> parquet_file = pq.ParquetFile('my_data.parquet')
> assert 
> isinstance(parquet_file.metadata.row_group(0).column(0).statistics.min, 
> Decimal) # <-- AssertionError here because min has type bytes rather than 
> Decimal
> assert 
> isinstance(parquet_file.metadata.row_group(0).column(0).statistics.max, 
> Decimal)
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7336) [C++] Implement MinMax options to not skip nulls

2019-12-10 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7336:
-
Summary: [C++] Implement MinMax options to not skip nulls  (was: implement 
minmax options)

> [C++] Implement MinMax options to not skip nulls
> 
>
> Key: ARROW-7336
> URL: https://issues.apache.org/jira/browse/ARROW-7336
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Compute
>Reporter: Yuan Zhou
>Assignee: Yuan Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> minmax kernel has MinMaxOptions but not used



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5303) [Rust] Add SIMD vectorization of numeric casts

2019-12-10 Thread Andy Thomason (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992584#comment-16992584
 ] 

Andy Thomason commented on ARROW-5303:
--

"if" statements (or match) in loops do not always have a happy ending.

I would just do the cast, regardless of the validity and copy the null bitmap 
from the source to the destination. In theory, you should disregard the value 
of a null data item.

Vec> is never going to be efficient as it takes at leas 2n 
bytes per element because of alignment and has terrible access patterns.

> [Rust] Add SIMD vectorization of numeric casts
> --
>
> Key: ARROW-5303
> URL: https://issues.apache.org/jira/browse/ARROW-5303
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.13.0
>Reporter: Neville Dipale
>Priority: Minor
>
> To improve the performance of cast kernels, we need SIMD support in numeric 
> casts.
> An initial exploration shows that we can't trivially add SIMD casts between 
> our Arrow T::Simd types, because `packed_simd` only supports a cast between 
> T::Simd types that have the same number of lanes.
> This means that adding casts from f64 to i64 (same lane length) satisfies the 
> bound trait `where TO::Simd : packed_simd::FromCast`, but f64 to 
> i32 (different lane length) doesn't.
> We would benefit from investigating work-arounds to this limitation. Please 
> see 
> [github::nevi_me::arrow/\{branch:simd-cast}/../kernels/cast.rs|[https://github.com/nevi-me/arrow/blob/simd-cast/rust/arrow/src/compute/kernels/cast.rs#L601]]
>  for an example implementation that's limited by the differences in lane 
> length.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas

2019-12-10 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7365:
-
Fix Version/s: 1.0.0

> [Python] Support FixedSizeList type in conversion to numpy/pandas
> -
>
> Key: ARROW-7365
> URL: https://issues.apache.org/jira/browse/ARROW-7365
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow-up on ARROW-7261, still need to add support for FixedSizeListType in 
> the arrow -> python conversion (arrow_to_pandas.cc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas

2019-12-10 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-7365:


 Summary: [Python] Support FixedSizeList type in conversion to 
numpy/pandas
 Key: ARROW-7365
 URL: https://issues.apache.org/jira/browse/ARROW-7365
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Follow-up on ARROW-7261, still need to add support for FixedSizeListType in the 
arrow -> python conversion (arrow_to_pandas.cc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7261) [Python] Python support for fixed size list type

2019-12-10 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-7261:
-

Assignee: Joris Van den Bossche

> [Python] Python support for fixed size list type
> 
>
> Key: ARROW-7261
> URL: https://issues.apache.org/jira/browse/ARROW-7261
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> I didn't see any issue about this, but {{FixedSizeListArray}} (ARROW-1280) is 
> not yet exposed in Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7261) [Python] Python support for fixed size list type

2019-12-10 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-7261.
---
Resolution: Fixed

Issue resolved by pull request 5906
[https://github.com/apache/arrow/pull/5906]

> [Python] Python support for fixed size list type
> 
>
> Key: ARROW-7261
> URL: https://issues.apache.org/jira/browse/ARROW-7261
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> I didn't see any issue about this, but {{FixedSizeListArray}} (ARROW-1280) is 
> not yet exposed in Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7362) [Python] ListArray.flatten() should take care of slicing offsets

2019-12-10 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992524#comment-16992524
 ] 

Joris Van den Bossche commented on ARROW-7362:
--

There was some discussion about this in ARROW-7031: 
https://github.com/apache/arrow/pull/5759, where it was said to not slice the 
values. 

Personally, I think it would be nice to have easy python access to the sliced 
values as well, but I also find it somewhat confusing to have {{.flatten()}} 
and {{.values}} differ.

> [Python] ListArray.flatten() should take care of slicing offsets
> 
>
> Key: ARROW-7362
> URL: https://issues.apache.org/jira/browse/ARROW-7362
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Minor
>
> Currently ListArray.flatten() simply returns the child array. If a ListArray 
> is a slice of another ListArray, they will share the same child array, 
> however the expected behavior (I think) of flatten() should be returning an 
> Array that's a concatenation of all the sub-lists in the ListArray, so the 
> slicing offset should be taken into account.
>  
> For example:
> a = pa.array([[1], [2], [3]])
> assert a.flatten().equals(pa.array([1,2,3]))
> # expected:
> a.slice(1).flatten().equals(pa.array([2, 3]))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7041) [Python] PythonLibs setting found by CMake uses wrong version of Python on macOS

2019-12-10 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7041:
-
Component/s: Python

> [Python] PythonLibs setting found by CMake uses wrong version of Python on 
> macOS
> 
>
> Key: ARROW-7041
> URL: https://issues.apache.org/jira/browse/ARROW-7041
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Hudon
>Priority: Major
>
> I'm trying to build the Python library and run its tests, so to do that I 
> need to first build the C++ library. I'm going through the Python Development 
> Guide part of the docs. When invoking CMake to build the C++ library, it 
> claims to have found PythonLibs here:
> -- Found PythonLibs: 
> /usr/local/Cellar/python@2/2.7.16_1/Frameworks/Python.framework/Versions/2.7/lib/libpython3.7m.dylib
> Just by looking at the whole path, it doesn't look like a promising location. 
> And indeed, there's no libpython3.7*.dylib file in the Python 2.7 install 
> directory. So the C++ build fails.
> I'm on macOS 10.14.6. I have Python 2.7 and 3.7 both installed via Homebrew. 
> (There is a libpython3.7*.dylib file in the Python 3.7 install of Homebrew.) 
> For the Python build dependencies, I have a Python 3.7 venv active and they 
> are installed there via pip. This happens with -DARROW_PYTHON=ON.
> This definitely looks like whatever piece of CMake code that is trying to 
> find PythonLibs is grabbing the first directory it finds, and appending a 
> path to the dylib without looking if it exists. However, I don't know much of 
> anything about CMake. Any suggestion for a fix or at least a workaround to 
> point CMake to the PythonLibs directory that would make more sense?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7266) [Python] dictionary_encode() of a slice gives wrong result

2019-12-10 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7266:
-
Fix Version/s: 1.0.0

> [Python] dictionary_encode() of a slice gives wrong result
> --
>
> Key: ARROW-7266
> URL: https://issues.apache.org/jira/browse/ARROW-7266
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.1
> Environment: Docker on Linux 5.2.18-200.fc30.x86_64; Python 3.7.4
>Reporter: Adam Hooper
>Priority: Major
> Fix For: 1.0.0
>
>
> Steps to reproduce:
> {code:python}
> import pyarrow as pa
> arr = pa.array(["a", "b", "b", "b"])[1:]
> arr.dictionary_encode()
> {code}
> Expected results:
> {code}
> -- dictionary:
>   [
> "b"
>   ]
> -- indices:
>   [
> 0,
> 0,
> 0
>   ]
> {code}
> Actual results:
> {code}
> -- dictionary:
>   [
> "b",
> ""
>   ]
> -- indices:
>   [
> 0,
> 0,
> 1
>   ]
> {code}
> I don't know a workaround. Converting to pylist and back is too slow. Is 
> there a way to copy the slice to a new offset-0 StringArray that I could then 
> dictionary-encode? Otherwise, I'm considering building buffers by hand



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7363) [Python] flatten() doesn't work on ChunkedArray

2019-12-10 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992519#comment-16992519
 ] 

Joris Van den Bossche commented on ARROW-7363:
--

>From looking at the code, I_think_ that the ChunkedArray {{flatten()}} method 
>maps to the StructArray.flatten() method, and not to the ListArray.flatten() 
>method. 

StructArray and ListArray implement (somewhat unfortunately maybe) a different 
flatten method: for StructArray it returns a list of arrays (returning one 
individual array for each field in the struct), while ListArray returns a new 
Array with one level of nesting reduced (list array -> array, or list of list 
array -> list array, ..). 

I am not fully sure how to deal with this. Should ChunkedArray.flatten do 
something different depending on the type? (but it's also not nice that the 
type of return is then variable) Should be rename the {{flatten()}} method for 
ListArrays ?

> [Python] flatten() doesn't work on ChunkedArray
> ---
>
> Key: ARROW-7363
> URL: https://issues.apache.org/jira/browse/ARROW-7363
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: marc abboud
>Priority: Major
>
> Flatten() doesn't work on ChunkedArray. It returns only the ChunkedArray in a 
> list without flattening anything.
> {code:java}
> // code placeholder
> aa = pa.array([[1],[2]])
> bb = pa.chunked_array([aa,aa])
>  
> bb.flatten()
> Out[15]:
> [ [ [ [ 1 ], [ 2 ] ], [ [ 1 ], [ 2 ] ] ]]
> Expected:
> [  [ 1, 2 ],  [ 1, 2 ] ]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Issue Comment Deleted] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet

2019-12-10 Thread Bogdan Klichuk (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Klichuk updated ARROW-7305:
--
Comment: was deleted

(was: Seems like its transformation of pandas to pyarrow.Table.

If you transform manually
{code:java}
table = pyarrow.Table.from_pandas(data) {code}
you'll see its the thing doing the memory spike, writing this table to parquet 
looks light.)

> [Python] High memory usage writing pyarrow.Table with large strings to parquet
> --
>
> Key: ARROW-7305
> URL: https://issues.apache.org/jira/browse/ARROW-7305
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Mac OSX
>Reporter: Bogdan Klichuk
>Priority: Major
>  Labels: parquet
>
> My case of datasets stored is specific. I have large strings (1-100MB each).
> Let's take for example a single row.
> 43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string.
> When I read this csv with pandas and then dump to parquet, my script consumes 
> 10x of the 43mb.
> With increasing amount of such rows memory footprint overhead diminishes, but 
> I want to focus on this specific case.
> Here's the footprint after running using memory profiler:
> {code:java}
> Line #Mem usageIncrement   Line Contents
> 
>  4 48.9 MiB 48.9 MiB   @profile
>  5 def test():
>  6143.7 MiB 94.7 MiB   data = pd.read_csv('43mb.csv')
>  7498.6 MiB354.9 MiB   data.to_parquet('out.parquet')
>  {code}
> Is this typical for parquet in case of big strings?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7363) [Python] flatten() doesn't work on ChunkedArray

2019-12-10 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7363:
-
Component/s: Python

> [Python] flatten() doesn't work on ChunkedArray
> ---
>
> Key: ARROW-7363
> URL: https://issues.apache.org/jira/browse/ARROW-7363
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: marc abboud
>Priority: Major
>
> Flatten() doesn't work on ChunkedArray. It returns only the ChunkedArray in a 
> list without flattening anything.
> {code:java}
> // code placeholder
> aa = pa.array([[1],[2]])
> bb = pa.chunked_array([aa,aa])
>  
> bb.flatten()
> Out[15]:
> [ [ [ [ 1 ], [ 2 ] ], [ [ 1 ], [ 2 ] ] ]]
> Expected:
> [  [ 1, 2 ],  [ 1, 2 ] ]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7363) [Python] flatten() doesn't work on ChunkedArray

2019-12-10 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7363:
-
Summary: [Python] flatten() doesn't work on ChunkedArray  (was: flatten() 
doesn't work on ChunkedArray)

> [Python] flatten() doesn't work on ChunkedArray
> ---
>
> Key: ARROW-7363
> URL: https://issues.apache.org/jira/browse/ARROW-7363
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.15.1
>Reporter: marc abboud
>Priority: Major
>
> Flatten() doesn't work on ChunkedArray. It returns only the ChunkedArray in a 
> list without flattening anything.
> {code:java}
> // code placeholder
> aa = pa.array([[1],[2]])
> bb = pa.chunked_array([aa,aa])
>  
> bb.flatten()
> Out[15]:
> [ [ [ [ 1 ], [ 2 ] ], [ [ 1 ], [ 2 ] ] ]]
> Expected:
> [  [ 1, 2 ],  [ 1, 2 ] ]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7364) [Rust] Add cast options to cast kernel

2019-12-10 Thread Neville Dipale (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-7364:
--
Component/s: Rust

> [Rust] Add cast options to cast kernel
> --
>
> Key: ARROW-7364
> URL: https://issues.apache.org/jira/browse/ARROW-7364
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
>
> The cast kernels currently do not take explicit options, but instead convert 
> overflows and invalid uft8 to nulls. We can create options that customise the 
> behaviour, similarly to CastOptions in CPP 
> ([https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.h#L38])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7364) [Rust] Add cast options to cast kernel

2019-12-10 Thread Neville Dipale (Jira)

Neville Dipale created ARROW-7364:
-

 Summary: [Rust] Add cast options to cast kernel
 Key: ARROW-7364
 URL: https://issues.apache.org/jira/browse/ARROW-7364
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Neville Dipale


The cast kernels currently do not take explicit options, but instead convert 
overflows and invalid uft8 to nulls. We can create options that customise the 
behaviour, similarly to CastOptions in CPP 
([https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.h#L38])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7363) flatten() doesn't work on ChunkedArray

2019-12-10 Thread marc abboud (Jira)

marc abboud created ARROW-7363:
--

 Summary: flatten() doesn't work on ChunkedArray
 Key: ARROW-7363
 URL: https://issues.apache.org/jira/browse/ARROW-7363
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.15.1
Reporter: marc abboud


Flatten() doesn't work on ChunkedArray. It returns only the ChunkedArray in a 
list without flattening anything.
{code:java}
// code placeholder
aa = pa.array([[1],[2]])
bb = pa.chunked_array([aa,aa])
 
bb.flatten()

Out[15]:
[ [ [ [ 1 ], [ 2 ] ], [ [ 1 ], [ 2 ] ] ]]

Expected:
[  [ 1, 2 ],  [ 1, 2 ] ]
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

41 matches

Mail list logo