wjones127 commented on issue #34755:
URL: https://github.com/apache/arrow/issues/34755#issuecomment-1489183918

   Maybe this is a tangent, but I think this question gets at how complex we 
want arrays to be. I sometimes wish whether an array is chunked or not were an 
implementation detail, rather than a top-level type. This is especially when 
considered in combination with other array differences. A good example of this 
is string arrays: between chunked and contiguous, indices size, and encodings, 
there are 36 possible string array data types which are represented as 5 
possible classes (ChunkedArray, StringArray, LargeStringArray, RunEndArray, 
DictionaryArray).
   
   <details>
   <summary>All 36 string arrays in PyArrow</summary>
   
   ```python
   strings = ["hello", "world"]
   # Can have i32 or i64 indices:
   pa.array(strings, pa.utf8())
   pa.array(strings, pa.large_utf8())
   # Can also be chunked
   pa.chunked_array(strings, pa.utf8())
   pa.chunked_array(strings, pa.large_utf8())
   # Can be dictionary encoded (with different indices width)
   pa.array(strings, pa.dictionary(pa.int32(), pa.utf8()))
   pa.array(strings, pa.dictionary(pa.int8(), pa.large_utf8()))
   # Can be run-end encoded
   pa.array(strings, pa.ree(pa.utf()))
   # Can be any combination of the above
   pa.chunked_array(strings, pa.ree(pa.dictionary(pa.int8(), pa.utf8())))
   ```
   
   ```python
   num_possible_indices = 2 # i32 or i64
   num_possible_chunking = 2 # contiguous or chunked
   num_possible_encodings = 3 # dictionary, ree, or ree + dictionary
   num_possible_dictionary_index = 4
   
   2 * 2 * ((2 * 4) + 1) = 36 possible string arrays in arrow
   ```
   
   These can be one of the following Python classes:
   ```
   ChunkedArray
   StringArray
   LargeStringArray
   RunEndArray
   DictionaryArray
   ```
   </details>
   
   I'd be in favor of keeping `pa.array()` returning either Array/ChunkedArray, 
since it's a high level function and I think I'd rather our higher-level APIs 
not care as much about the buffer layout.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to