[jira] [Commented] (ARROW-360) C++: Add method to shrink PoolBuffer using realloc

2017-01-06 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15805217#comment-15805217
 ] 

Uwe L. Korn commented on ARROW-360:
---

PR: https://github.com/apache/arrow/pull/272

> C++: Add method to shrink PoolBuffer using realloc
> --
>
> Key: ARROW-360
> URL: https://issues.apache.org/jira/browse/ARROW-360
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>
> In the case where we have optimistically allocated a large PoolBuffer, we 
> could shrink it later again using a call to {{realloc}}. This should free the 
> exceeding memory but avoids an actual {{memcpy}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-96) C++: API documentation using Doxygen

2017-01-06 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15805024#comment-15805024
 ] 

Uwe L. Korn commented on ARROW-96:
--

PR: https://github.com/apache/arrow/pull/271

> C++: API documentation using Doxygen 
> -
>
> Key: ARROW-96
> URL: https://issues.apache.org/jira/browse/ARROW-96
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>
> For the developers using Arrow via C++, we should provide an automatically 
> generated API documentation via doxygen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ARROW-462) [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent

2017-01-06 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-462:
---
Description: 
We use a hash table to extract unique values and dictionary indices. There may 
be an opportunity to consolidate common code from the dictionary encoding 
implementation implemented in parquet-cpp (but the dictionary indices will not 
be run-length encoded in Arrow):

https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h

This functionality also needs to permit encoding split across multiple record 
batches -- so the hash table would be a stateful entity, and we can continue to 
hash more chunks of data to dictionary-encode multiple arrays with a shared 
dictionary at the end. 

  was:
We use a hash table to extract unique values and dictionary indices. There may 
be an opportunity to consolidate common code from the dictionary encoding 
implementation implemented in parquet-cpp (but the dictionary indices will not 
be run-length encoded in Arrow):

https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h


> [C++] Implement in-memory conversions between non-nested primitive types and 
> DictionaryArray equivalent
> ---
>
> Key: ARROW-462
> URL: https://issues.apache.org/jira/browse/ARROW-462
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>
> We use a hash table to extract unique values and dictionary indices. There 
> may be an opportunity to consolidate common code from the dictionary encoding 
> implementation implemented in parquet-cpp (but the dictionary indices will 
> not be run-length encoded in Arrow):
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h
> This functionality also needs to permit encoding split across multiple record 
> batches -- so the hash table would be a stateful entity, and we can continue 
> to hash more chunks of data to dictionary-encode multiple arrays with a 
> shared dictionary at the end. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-462) [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent

2017-01-06 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804932#comment-15804932
 ] 

Uwe L. Korn commented on ARROW-462:
---

Ah, that makes sense. This may be possible to provide with 
{{std::unordered_map}} but maybe not in a simple way.

> [C++] Implement in-memory conversions between non-nested primitive types and 
> DictionaryArray equivalent
> ---
>
> Key: ARROW-462
> URL: https://issues.apache.org/jira/browse/ARROW-462
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>
> We use a hash table to extract unique values and dictionary indices. There 
> may be an opportunity to consolidate common code from the dictionary encoding 
> implementation implemented in parquet-cpp (but the dictionary indices will 
> not be run-length encoded in Arrow):
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-462) [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent

2017-01-06 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804922#comment-15804922
 ] 

Wes McKinney commented on ARROW-462:


One issue is the handling of the hash keys (e.g. strings). After performing the 
hash table pass, you would like to minimize time to create the final dictionary 
and indices arrays. We can run various performance experiments and choose 
whatever yields best performance for simplicity. 

> [C++] Implement in-memory conversions between non-nested primitive types and 
> DictionaryArray equivalent
> ---
>
> Key: ARROW-462
> URL: https://issues.apache.org/jira/browse/ARROW-462
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>
> We use a hash table to extract unique values and dictionary indices. There 
> may be an opportunity to consolidate common code from the dictionary encoding 
> implementation implemented in parquet-cpp (but the dictionary indices will 
> not be run-length encoded in Arrow):
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-462) [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent

2017-01-06 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804914#comment-15804914
 ] 

Uwe L. Korn commented on ARROW-462:
---

Might be also a point to reconsider if it's worth to have a custom hash-table 
implementation or if using {{std:unordered_map}} is leaving us with the same 
performance.

> [C++] Implement in-memory conversions between non-nested primitive types and 
> DictionaryArray equivalent
> ---
>
> Key: ARROW-462
> URL: https://issues.apache.org/jira/browse/ARROW-462
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>
> We use a hash table to extract unique values and dictionary indices. There 
> may be an opportunity to consolidate common code from the dictionary encoding 
> implementation implemented in parquet-cpp (but the dictionary indices will 
> not be run-length encoded in Arrow):
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ARROW-462) [C++] Implement in-memory conversions between non-nested primitive types and DictionaryArray equivalent

2017-01-06 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-462:
--

 Summary: [C++] Implement in-memory conversions between non-nested 
primitive types and DictionaryArray equivalent
 Key: ARROW-462
 URL: https://issues.apache.org/jira/browse/ARROW-462
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


We use a hash table to extract unique values and dictionary indices. There may 
be an opportunity to consolidate common code from the dictionary encoding 
implementation implemented in parquet-cpp (but the dictionary indices will not 
be run-length encoded in Arrow):

https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (ARROW-427) [C++] Implement dictionary-encoded array container

2017-01-06 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-427.

Resolution: Fixed

Issue resolved by pull request 268
[https://github.com/apache/arrow/pull/268]

> [C++] Implement dictionary-encoded array container
> --
>
> Key: ARROW-427
> URL: https://issues.apache.org/jira/browse/ARROW-427
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> This will compose an array of dictionary indices (int32 type currently per 
> the format doc) and a dictionary array



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ARROW-461) [Python] Implement conversion between arrow::DictionaryArray and pandas.Categorical

2017-01-06 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-461:
--

 Summary: [Python] Implement conversion between 
arrow::DictionaryArray and pandas.Categorical
 Key: ARROW-461
 URL: https://issues.apache.org/jira/browse/ARROW-461
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ARROW-460) [C++] Implement JSON round trip for DictionaryArray

2017-01-06 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-460:
--

 Summary: [C++] Implement JSON round trip for DictionaryArray
 Key: ARROW-460
 URL: https://issues.apache.org/jira/browse/ARROW-460
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ARROW-459) [C++] Implement IPC round trip for DictionaryArray, dictionaries shared across record batches

2017-01-06 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-459:
--

 Summary: [C++] Implement IPC round trip for DictionaryArray, 
dictionaries shared across record batches
 Key: ARROW-459
 URL: https://issues.apache.org/jira/browse/ARROW-459
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (ARROW-456) C++: Add jemalloc based MemoryPool

2017-01-06 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-456.
---
Resolution: Fixed

Issue resolved by pull request 270
[https://github.com/apache/arrow/pull/270]

> C++: Add jemalloc based MemoryPool
> --
>
> Key: ARROW-456
> URL: https://issues.apache.org/jira/browse/ARROW-456
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>
> This should allow us to do aligned realloc calls.
> As we don't want to enforce jemalloc as the only allocator, this will be an 
> additional (small) leaf library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ARROW-458) Python: Expose jemalloc MemoryPool

2017-01-06 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-458:
-

 Summary: Python: Expose jemalloc MemoryPool
 Key: ARROW-458
 URL: https://issues.apache.org/jira/browse/ARROW-458
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn


Expose the {{jemalloc::MemoryPool}} to Python users as a separate 
{{pyarrow.jemalloc}} module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ARROW-457) Python: Better control over memory pool

2017-01-06 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-457:
-

 Summary: Python: Better control over memory pool
 Key: ARROW-457
 URL: https://issues.apache.org/jira/browse/ARROW-457
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Uwe L. Korn


Currently we have a separate {{PyArrowMemoryPool}} implemented in 
{{src/pyarrrow/common.cc/h}}. Instead we should use the default memory pool 
from Arrow-C++ as often as possible.

Furthermore the user should be able to configure which MemoryPool is actually 
used in the cases where one can select a custom MemoryPool. For ease of use, 
there should also be a way to switch the default MemoryPool in Python to a 
user-selected one, e.g. the {{jemalloc::MemoryPool}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-456) C++: Add jemalloc based MemoryPool

2017-01-06 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804283#comment-15804283
 ] 

Uwe L. Korn commented on ARROW-456:
---

PR: https://github.com/apache/arrow/pull/270

> C++: Add jemalloc based MemoryPool
> --
>
> Key: ARROW-456
> URL: https://issues.apache.org/jira/browse/ARROW-456
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>
> This should allow us to do aligned realloc calls.
> As we don't want to enforce jemalloc as the only allocator, this will be an 
> additional (small) leaf library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)