[jira] [Commented] (ARROW-360) C++: Add method to shrink PoolBuffer using realloc
[ https://issues.apache.org/jira/browse/ARROW-360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15805217#comment-15805217 ] Uwe L. Korn commented on ARROW-360: --- PR: https://github.com/apache/arrow/pull/272 > C++: Add method to shrink PoolBuffer using realloc > -- > > Key: ARROW-360 > URL: https://issues.apache.org/jira/browse/ARROW-360 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > > In the case where we have optimistically allocated a large PoolBuffer, we > could shrink it later again using a call to {{realloc}}. This should free the > exceeding memory but avoids an actual {{memcpy}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-96) C++: API documentation using Doxygen
[ https://issues.apache.org/jira/browse/ARROW-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15805024#comment-15805024 ] Uwe L. Korn commented on ARROW-96: -- PR: https://github.com/apache/arrow/pull/271 > C++: API documentation using Doxygen > - > > Key: ARROW-96 > URL: https://issues.apache.org/jira/browse/ARROW-96 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > > For the developers using Arrow via C++, we should provide an automatically > generated API documentation via doxygen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ARROW-462) [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent
[ https://issues.apache.org/jira/browse/ARROW-462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-462: --- Description: We use a hash table to extract unique values and dictionary indices. There may be an opportunity to consolidate common code from the dictionary encoding implementation implemented in parquet-cpp (but the dictionary indices will not be run-length encoded in Arrow): https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h This functionality also needs to permit encoding split across multiple record batches -- so the hash table would be a stateful entity, and we can continue to hash more chunks of data to dictionary-encode multiple arrays with a shared dictionary at the end. was: We use a hash table to extract unique values and dictionary indices. There may be an opportunity to consolidate common code from the dictionary encoding implementation implemented in parquet-cpp (but the dictionary indices will not be run-length encoded in Arrow): https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h > [C++] Implement in-memory conversions between non-nested primitive types and > DictionaryArray equivalent > --- > > Key: ARROW-462 > URL: https://issues.apache.org/jira/browse/ARROW-462 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney > > We use a hash table to extract unique values and dictionary indices. There > may be an opportunity to consolidate common code from the dictionary encoding > implementation implemented in parquet-cpp (but the dictionary indices will > not be run-length encoded in Arrow): > https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h > This functionality also needs to permit encoding split across multiple record > batches -- so the hash table would be a stateful entity, and we can continue > to hash more chunks of data to dictionary-encode multiple arrays with a > shared dictionary at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-462) [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent
[ https://issues.apache.org/jira/browse/ARROW-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804932#comment-15804932 ] Uwe L. Korn commented on ARROW-462: --- Ah, that makes sense. This may be possible to provide with {{std::unordered_map}} but maybe not in a simple way. > [C++] Implement in-memory conversions between non-nested primitive types and > DictionaryArray equivalent > --- > > Key: ARROW-462 > URL: https://issues.apache.org/jira/browse/ARROW-462 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney > > We use a hash table to extract unique values and dictionary indices. There > may be an opportunity to consolidate common code from the dictionary encoding > implementation implemented in parquet-cpp (but the dictionary indices will > not be run-length encoded in Arrow): > https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-462) [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent
[ https://issues.apache.org/jira/browse/ARROW-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804922#comment-15804922 ] Wes McKinney commented on ARROW-462: One issue is the handling of the hash keys (e.g. strings). After performing the hash table pass, you would like to minimize time to create the final dictionary and indices arrays. We can run various performance experiments and choose whatever yields best performance for simplicity. > [C++] Implement in-memory conversions between non-nested primitive types and > DictionaryArray equivalent > --- > > Key: ARROW-462 > URL: https://issues.apache.org/jira/browse/ARROW-462 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney > > We use a hash table to extract unique values and dictionary indices. There > may be an opportunity to consolidate common code from the dictionary encoding > implementation implemented in parquet-cpp (but the dictionary indices will > not be run-length encoded in Arrow): > https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-462) [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent
[ https://issues.apache.org/jira/browse/ARROW-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804914#comment-15804914 ] Uwe L. Korn commented on ARROW-462: --- Might be also a point to reconsider if it's worth to have a custom hash-table implementation or if using {{std:unordered_map}} is leaving us with the same performance. > [C++] Implement in-memory conversions between non-nested primitive types and > DictionaryArray equivalent > --- > > Key: ARROW-462 > URL: https://issues.apache.org/jira/browse/ARROW-462 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney > > We use a hash table to extract unique values and dictionary indices. There > may be an opportunity to consolidate common code from the dictionary encoding > implementation implemented in parquet-cpp (but the dictionary indices will > not be run-length encoded in Arrow): > https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ARROW-462) [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent
Wes McKinney created ARROW-462: -- Summary: [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent Key: ARROW-462 URL: https://issues.apache.org/jira/browse/ARROW-462 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney We use a hash table to extract unique values and dictionary indices. There may be an opportunity to consolidate common code from the dictionary encoding implementation implemented in parquet-cpp (but the dictionary indices will not be run-length encoded in Arrow): https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (ARROW-427) [C++] Implement dictionary-encoded array container
[ https://issues.apache.org/jira/browse/ARROW-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-427. Resolution: Fixed Issue resolved by pull request 268 [https://github.com/apache/arrow/pull/268] > [C++] Implement dictionary-encoded array container > -- > > Key: ARROW-427 > URL: https://issues.apache.org/jira/browse/ARROW-427 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > > This will compose an array of dictionary indices (int32 type currently per > the format doc) and a dictionary array -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ARROW-461) [Python] Implement conversion between arrow::DictionaryArray and pandas.Categorical
Wes McKinney created ARROW-461: -- Summary: [Python] Implement conversion between arrow::DictionaryArray and pandas.Categorical Key: ARROW-461 URL: https://issues.apache.org/jira/browse/ARROW-461 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Wes McKinney -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ARROW-460) [C++] Implement JSON round trip for DictionaryArray
Wes McKinney created ARROW-460: -- Summary: [C++] Implement JSON round trip for DictionaryArray Key: ARROW-460 URL: https://issues.apache.org/jira/browse/ARROW-460 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ARROW-459) [C++] Implement IPC round trip for DictionaryArray, dictionaries shared across record batches
Wes McKinney created ARROW-459: -- Summary: [C++] Implement IPC round trip for DictionaryArray, dictionaries shared across record batches Key: ARROW-459 URL: https://issues.apache.org/jira/browse/ARROW-459 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (ARROW-456) C++: Add jemalloc based MemoryPool
[ https://issues.apache.org/jira/browse/ARROW-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn resolved ARROW-456. --- Resolution: Fixed Issue resolved by pull request 270 [https://github.com/apache/arrow/pull/270] > C++: Add jemalloc based MemoryPool > -- > > Key: ARROW-456 > URL: https://issues.apache.org/jira/browse/ARROW-456 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > > This should allow us to do aligned realloc calls. > As we don't want to enforce jemalloc as the only allocator, this will be an > additional (small) leaf library. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ARROW-458) Python: Expose jemalloc MemoryPool
Uwe L. Korn created ARROW-458: - Summary: Python: Expose jemalloc MemoryPool Key: ARROW-458 URL: https://issues.apache.org/jira/browse/ARROW-458 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Uwe L. Korn Assignee: Uwe L. Korn Expose the {{jemalloc::MemoryPool}} to Python users as a separate {{pyarrow.jemalloc}} module. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ARROW-457) Python: Better control over memory pool
Uwe L. Korn created ARROW-457: - Summary: Python: Better control over memory pool Key: ARROW-457 URL: https://issues.apache.org/jira/browse/ARROW-457 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Uwe L. Korn Currently we have a separate {{PyArrowMemoryPool}} implemented in {{src/pyarrrow/common.cc/h}}. Instead we should use the default memory pool from Arrow-C++ as often as possible. Furthermore the user should be able to configure which MemoryPool is actually used in the cases where one can select a custom MemoryPool. For ease of use, there should also be a way to switch the default MemoryPool in Python to a user-selected one, e.g. the {{jemalloc::MemoryPool}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-456) C++: Add jemalloc based MemoryPool
[ https://issues.apache.org/jira/browse/ARROW-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804283#comment-15804283 ] Uwe L. Korn commented on ARROW-456: --- PR: https://github.com/apache/arrow/pull/270 > C++: Add jemalloc based MemoryPool > -- > > Key: ARROW-456 > URL: https://issues.apache.org/jira/browse/ARROW-456 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > > This should allow us to do aligned realloc calls. > As we don't want to enforce jemalloc as the only allocator, this will be an > additional (small) leaf library. -- This message was sent by Atlassian JIRA (v6.3.4#6332)