wjones127 commented on issue #34755: URL: https://github.com/apache/arrow/issues/34755#issuecomment-1489183918
Maybe this is a tangent, but I think this question gets at how complex we want arrays to be. I sometimes wish whether an array is chunked or not were an implementation detail, rather than a top-level type. This is especially when considered in combination with other array differences. A good example of this is string arrays: between chunked and contiguous, indices size, and encodings, there are 36 possible string array data types which are represented as 5 possible classes (ChunkedArray, StringArray, LargeStringArray, RunEndArray, DictionaryArray). <details> <summary>All 36 string arrays in PyArrow</summary> ```python strings = ["hello", "world"] # Can have i32 or i64 indices: pa.array(strings, pa.utf8()) pa.array(strings, pa.large_utf8()) # Can also be chunked pa.chunked_array(strings, pa.utf8()) pa.chunked_array(strings, pa.large_utf8()) # Can be dictionary encoded (with different indices width) pa.array(strings, pa.dictionary(pa.int32(), pa.utf8())) pa.array(strings, pa.dictionary(pa.int8(), pa.large_utf8())) # Can be run-end encoded pa.array(strings, pa.ree(pa.utf())) # Can be any combination of the above pa.chunked_array(strings, pa.ree(pa.dictionary(pa.int8(), pa.utf8()))) ``` ```python num_possible_indices = 2 # i32 or i64 num_possible_chunking = 2 # contiguous or chunked num_possible_encodings = 3 # dictionary, ree, or ree + dictionary num_possible_dictionary_index = 4 2 * 2 * ((2 * 4) + 1) = 36 possible string arrays in arrow ``` These can be one of the following Python classes: ``` ChunkedArray StringArray LargeStringArray RunEndArray DictionaryArray ``` </details> I'd be in favor of keeping `pa.array()` returning either Array/ChunkedArray, since it's a high level function and I think I'd rather our higher-level APIs not care as much about the buffer layout. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
